Improvements to content search for digitized newspaper material in use

Date published

The oldest digitized newspaper material of the National Library from 1771 to 1914 will be re-processed in such a way that its text recognition and, at the same time, the relevance of searches for the text will be significantly improved. OCR is a very important part of the quality and usability of digitized historical source material. The languages ​​of the newspaper collection are Finnish and Swedish. Better quality material has been made available in stages since the summer of 2021, and about 90% of the journals have been made available so far.

A selection of journals that are important to research and society from 1915–18 will also be re-introduced: the titles are Uusi Aura, Uusi Suometar, Helsingin Sanomat, Åbo Underrättelser, Västra Finland and Hufvudstadsbladet.

In total, almost 2.5 million pages of better quality newspaper material will be available on digi.nationallibrary.fi. The National Library is also planning new text recognition for some of the newspapers published in the 1910s.

The relevancy of content searches will improve significantly

Compared to previous OCR results, the quality of the reprocessed material is significantly better, averaging 17 percentage points. OCR has been assisted by the Transkribus platform, which was originally developed to recognize handwritten text but has also been successfully applied to printed material in the Horizon 2020-funded NewsEye project. The National Library is involved in the project.

Improvements in the text recognition of newspaper material are directed to an older typefactor that has better recognition ability in Transkribus than previously used programs.

The introduction of better quality digital newspaper materials has been done in the Digital Open Memory project. The work has utilized the automatic text recognition model developed in the NewsEye project of the European Union's Horizon program and collaborated with the European READ-COOP cooperative. The National Library is a member of the NewsEye project and the READ-COOP cooperative. The NewsEye project (newseye.eu) has developed software tools for researching and using digitized historical newspapers. The READ-COOP cooperative (readcoop.eu) is developing the usability of historical materials using artificial intelligence.

Read more in Tietolinja's article: https://tietolinja.kansalliskirjasto.fi/2021-2/2102-digi/

digi.nationallibrary.fi

 

 

Contact person

Minna Kaukonen
Mikkelin toimipiste
0504155450
Kansalliskirjasto
Saimaankatu 6
50100 Mikkeli