1.1 Introduction to Historical Newspaper Analysis and Digitisation

1.1.3 From paper to digital

We have seen that digitising historical newspapers is an excellent asset for research and the digital humanities. We also know that powerful and efficient tools are available. But is it enough?

The answer is definitely no!

The process that leads to the availability of text extracted from historical newspapers is very complex and requires many processing steps. Scientific, methodological or technical problems may arise at each stage of this process. In most cases, you do not know what process the data has undergone, so you face a black box problem.

To be able to analyse historical newspapers while guaranteeing a high level of results, it is essential to understand the nature of the process undergone by the data and their impact on your research.

When we mention the potential biases that exist when using digitised historical newspaper data, we immediately think of the well-known problem of OCR. We will come back to this. Nevertheless, other problems arise before this stage. The first is related to an observation that is still valid today: not everything is digitised!

In a 2013 article, whose conclusions remain valid, Canadian historian Ian Milligan made the following observation:

"It all seems so orderly and comprehensive. Instead of firing up the microfilm reader to navigate the Globe and Mail or the Toronto Star, one needs only to log into online newspaper databases. A keyword search for a particular event, person, or cultural phenomenon, brings up a list of research findings. Previously impossible research projects can now be attempted. This process has fundamentally reshaped Canadian historical scholarship. We can see this in Canadian history dissertations. In 1998, a year with 67 dissertations, the Toronto Star was cited 74 times. However, it was cited 753 times in 2010, a year with 69 dissertations." (p. 1)


Obviously, the online availability of historical newspapers encourages using them in scientific publications. Instinctively, this does not raise any problem, yet the author draws two conclusions from his work:

"Firstly, online historical databases have profoundly shaped Canadian historiography. In a shift that is rarely – if ever – made explicit, Canadian historians have profoundly reacted to the availability of online databases. Secondly, historians need to understand how OCR works, in order to bring a level of methodological rigor to their work that use these sources." (p. 2)

If the documents are digitised, does that mean they are easily accessible?

After digitisation, the documents are usually available in a digital library. But to be accessible, some processing is necessary. Automatic document indexing is a field of computer science and information science that uses computer methods to organise a document's collection and subsequently facilitate searching for content within that collection. The diversity of document types (textual, audiovisual, Web) gives rise to very different approaches, particularly in terms of data representation. Nevertheless, they are based on a common set of theories, such as feature extraction, clustering, quantification, etc.

Indexing is based on various means, techniques, and methodological and technical choices. Indexing is always intended to facilitate the search for information. In the case of a digital library of historical newspapers, this objective is difficult to achieve since the uses are very different. A person looking to trace his ancestors in historical newspapers will not do the same search as a researcher wishing to build a scientific corpus.

Once again, the characteristics of the indexing and the search engine's functioning are unknown to the user. Depending on the way the documents have been indexed and the search strategy that a user sets up. It is common to find that certain documents are, in practice, excluded from the results even though they would have a rightful place in them. We are again faced with a black box.

Fortunately, in platforms such as NewsEye, it is possible to make very precise and customisable searches that can guarantee a limitation of information search bias. Unit 2 of this course will focus on understanding how digitised historical texts are stored and classified and how the search engine finds and orders the results.


REFERENCES
  1. Milligan, I. (2013). Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997–2010. https://doi.org/10.3138/chr.694