Web Harvesting

The National Library collects and archives domestic public online material on a diverse range. Online material is harvested with an automatic crawler or with the help of the publishers. If the National Library cannot harvest the material automatically but considers it important, it contacts the publisher, who makes the harvest possible or submits the material by other means.

The materials can be found fully indexed in the library's internet archive. The archive will be accessible to the public on the legal deposit workstations located in the National Library, the Library of Parliament, and the Finnish Film Archive as well as in all deposit libraries in Finland.

Annual web harvest

Large Finnish domain harvesting is conducted at least once a year with an automatic web crawler. The goal is to harvest as much online material as possible using Internet domains such as 'fi' and 'ax'. Also other domestic webpages are archived extensively.

Thematic harvest

The purpose of themed searches is to archive online material relating to some particular subject or topical issue. These materials include:

  • Current materials relating to important national and state affairs (e.g., elections and official visits)

  • Materials relating to other events that are in danger of disappearing from the internet soon after the event (e.g., major sports competitions, festivals and concerts)

  • Unexpected events in global politics, natural disasters and other similar events

  • Harvests which are conducted in cooperation with memory organisations and various research institutions

Link lists for the thematic harvest are compiled by the library staff. The materials will be catalogued in the National Bibliography (Fennica) and National Discography (Viola) where applicable.


Technical information

The National Library harvests webpages using the Heritrix Web Crawler. The primary task is to archive webpages but other types of files are also archived (e.g., ftp).

Harvesting is divided over several websites and time periods in a way that the load for one single web server remains small. Even our largest harvests have not generated a notable increase in load on the backbone network.


While harvesting, the National Library's robot identifies itself with following data:

User-Agent: Mozilla/5.0 (compatible; heritrix/1.14.0+http://www.nationallibrary.fi/)
From: kk-webcrawler@helsinki.fi

The National Library also searches for webpages located in Finland by scanning the servers on the Internet and checking whether they publish webpages (HTTP/gate 80). The search for these pages is done with the National Library engine nwa5a.lib.helsinki.fi (IP 128.214.91.134).


The content of robots.txt files are usually taken into account while harvesting. The National Library may also decide to archive material which is protected by robots.txt files if the material is considered to be an important part of the harvest in question.


Harvested files and the data communications of the harvesting process are archived as such in ARC or WARC format.

All questions relating to web harvests can be directed to e-vapaa(at)helsinki.fi.

Search

Databases


From site
URL : http://www.nationallibrary.fi/text/index/publishers/deposit/webharvesting.html