Beyond words: exploring historical document collections at scale

Earlier this month, I blogged about The National Archives’ collaboration with City, University of London to visualise data from our collections in new ways. This research took place as part of our digital experimentation workshops, hosted in partnership with universities.

In this second blog of the series, Dr Anne Alexander and myself will discuss how workshops with the University of Cambridge, UK aimed to understand large collections by enabling complex analysis of large-scale datasets.

At The National Archives, we have begun exploring new ways of opening up access to our large-scale collections for research and digital exploration. The scale of our digital collections is huge, spanning nearly 1,000 years of history; combined with new computational techniques we are asking ourselves how we can enable researchers to be best explore our large digital collections.

By doing this, we hope to assist researchers in understanding and using our records in ways that they previously haven’t been able to, opening up more opportunities for pioneering and interdisciplinary research.

In collaboration with Cambridge Digital Humanities, researchers at The National Archives held two workshops to tackle this ambition.

We began with presentations mapping the terrain of current scholarship and state-of-the-art archival and technical approaches to understanding collections at scale.

Our second workshop, supported by Cambridge Big Data, followed this up by bringing together a multidisciplinary group of archivists, computer scientists and humanities scholars to explore the technical challenges in exploring and analysing historical document collections at scale.

Possible approaches

We were eager to consider a wide range of possible methods for understanding The National Archives’ collections at scale.

The potential for automation of spatial and temporal readings of large volumes of documents was of particular interest to us. We wanted to understand how this technique might enrich and complement well-established lexical approaches rooted in Natural Language Processing methods, dissolving document structures into ‘bags of words’.

Another area of interest for the group was the discovery and analysis of annotations. As the process of annotation can be defined both spatially (by the location of the annotation in relation to word, line or margin) and temporally (as addition which takes place after the composition of the original text, or as a middle stage in the composition of the final text), this was an interesting starting point for further exploration.

In addition to layout features, we also explored ideas that might develop our understanding of the document as a process rather than an image – for example, could we develop tools to sort a large collection according to where its individual documents stand in timelines of editorial work and policy decision-making?

Finally, we also considered new ways to think about the archive itself and the challenges presented by clashes between the ‘filing cabinet’ structure of the existing catalogue and the networked nature of the digital data emerging both ‘inside’ the archive and ‘outside’ on the worldwide web.

Based on the findings we have so far, we are now exploring funding opportunities to enable us continue this work, with the aim of bridging the gap between cultural heritage organisations and academic researchers.

One of the underlying principles of our approach was to try and identify research problems at the intersection of interactions between archivists, humanities researchers and computer scientists with The National Archives’ collections. We want to find common ground for intellectual inquiry, rather than restricting any one of the partners to a ‘service’ orientated role in a future project.

This research forms part of a series of digital experimentation workshops hosted by The National Archives throughout the 2017-18 academic year, in collaboration with different universities. We kicked off with a blog on our data visualisation workshop series, but do keep an eye on the blog for more updates on our digital research.


This blog was co-authored by Dr Eirini Goudarouli, Digital and Technology Research Lead at The National Archives, and Dr Anne Alexander, Co-ordinator, Digital Humanities Network at Centre for Research in the Arts Social Sciences and Humanities (CRASSH), University of Cambridge.

Leave a comment

Visit this page for family history and other research enquiries. Please do not post personal information. All comments are pre-moderated. See our moderation policy for more details.

Your email address will not be published. Required fields are marked *