Web Harvesting

The National Library collects and archives domestic public on-line material on a diverse range. On-line material is harvested with automatic crawler or with the help of the publishers. If the National Library cannot harvest the material automatically but considers it important, it contacts the publisher who makes the harvest possible or submits the material by other means.

The materials can be found fully indexed in the library's internet archive. The archive will be accessible for the public on the legal deposit workstations located in National Library, Library of Parliament, and Finnish Film Archive and in all deposit libraries in Finland.

Annual Web Harvest

Large Finnish domain harvesting is conducted at least once a year with an automatic web crawler. The goal is to harvest as much on-line material as possible using internet domains such as 'fi' and 'ax'. Also other domestic web pages are archived extensively.

Thematic harvest

The purpose of themed searches is to archive on-line material relating to some particular subject or topical issue. These materials include:

  • current materials relating to important national and state affairs (e.g. elections and official visits)

  • materials relating to other events that are in danger to disappear from the internet soon after the event (e.g. major sports competitions, festivals and concerts)

  • unexpected events in global politics, natural disasters and other similar events

  • harvests which are conducted in cooperation with memory organisations and various research institutions.

Link lists for the thematic harvest are compiled by the library staff. The materials will be catalogued in the National Bibliography (Fennica) and National Discography (Viola) where applicable.


Technical Information

The National Library harvests web pages using the Heritrix Web Crawler. Main task is to archive web pages but also other types of files are archived (e.g. ftp).

Harvesting is divided over several web sites and time-period in a way that the load for one single www-server remains small. Even our largest harvests have not generated notable increase of load on the backbone network.


While harvesting the National Library's robot identifies itself with following data:

User-Agent: Mozilla/5.0 (compatible; heritrix/1.14.0+http://www.nationallibrary.fi/)
From: kk-webcrawler@helsinki.fi

National Library also searches for www-pages located in Finland by scanning the servers on the internet and checking whether they publish www-pages (HTTP/gate 80). Search for these pages is done by the National Library engine nwa5a.lib.helsinki.fi (IP 128.214.91.134).


The content of the so called robots.txt files are usually taken into account while harvesting. National Library may also decide to archive material which is protected by robots.txt files if the material is considered to be important part of the harvest in question.


Harvested files and the data communications of the harvesting process are archived as such in ARC or WARC format.

All questions relating to web harvest can be directed to e-vapaa(at)helsinki.fi

Databases


From site
URL : http://www.nationallibrary.fi/publishers/deposit/webharvesting.html