Research Exchange: Preservation in digital research

Our Research Exchange series features interviews with researchers across The National Archives in which they discuss their new discoveries and their work’s potential for impact. The purpose of the series is to highlight and demystify the research we do and open it up to new ideas and discussion.

This month, to coincide with our Annual Digital Lecture, we have three blogs focused on themes related to our digital research: Preservation, Presentation and Interpretation.

This blog focuses on the theme of Preservation – how do we use digital tools to safeguard historical records? We’ve invited some of our researchers to speak about the following projects:

Tom Storrar and Michael Tobin on the Web Archive

The onset of the COVID-19 pandemic initiated a proliferation of rapidly changing online government content. The frequency of change and the significance of the content was unparalleled, and this persisted for 18 months.

As the official web archive of government, it is our responsibility to capture the government’s response to the pandemic, which has consisted of a dual effort of researching advanced capture methods and analysis approaches.

In terms of capture, we have utilised novel, open-source capture software, combined with in-house processes such as data-driven archiving, change detection and cross-collection keyword capture to increase our capacity and coverage. In analysing the collection, we have designed a database structure that will allow us to answer complex questions about it. This has involved creating a new conceptual model of the collection, experimenting with different algorithms, which has opened avenues for new ways to access the collection.

Dr Santhilata Kuppili Venkata on Email contextualisation

Email will be a resource of paramount importance to future historians. But unlike other born-digital collections, email archives pose unique challenges to make them searchable. The discovery tool developed as a part of our project (eConDist) enables users to complete context-based search from the contents of emails.

Sarah VanSnick on environmental modelling in the Collection Care Department

We monitor and collect a lot of environmental condition data (temperature and relative humidity (RH)). This is to ensure conditions in our storage spaces are suitable for long-term archival storage. Environmental modelling helps inform our decision-making about the future of our building and collection.

What motivated you to begin this research?

[TS and MT]: In the early days of the COVID-19 outbreak it quickly became clear that this project was a huge undertaking and that we would require new approaches and tools. Some of these were in the pipeline already but were brought forward due to the unprecedented challenge.

[SKV]: For over two decades, organizations and individuals have used email as a primary communication channel, which have increasingly become part of archival workflows. While many people in the archival profession have been wrangling with the vital pre-conditions for providing access to these collections (e.g. privacy), another important consideration is how end-users will actually engage with them.

Unlike other born-digital collections, email archives pose unique challenges to make them searchable. This is often overlooked since emails are currently rarely accessible due to privacy restrictions. Yet even where they are accessible (Enron Email Corpus, US presidential emails), providing access to researchers in a meaningful way can be challenging due to the inherently networked nature of email conversations. To address this issue, we have developed eConDist, a prototype tool that allows for a context-rich search of organizational email corpora.

[SV]: This project was needed to check that the changes we implemented in our building management system in 2013 have been providing the environmental conditions for the collection over the longer term.

What surprising discoveries have you made?

[TS and MT]: Keyword special crawls were run over all sites in our database. This led to some sites which had been closed being captured again. Some of the URLs were now pointing at completely random sites!

[SKV]: There are two significant discoveries. The first was a context-based search tool to work on emails to find patterns or themes. This is a new way of searching emails beyond the keyword search. The second was effective ways to combine human input in personalizing the search.

[SV]: We discovered that it isn’t in the hottest years that our building and mechanical systems don’t perform as well against the criteria we set, but the wettest! To understand the risks going forward we need to better understand the interaction between our building and extreme climate events.

What makes this project important – for The National Archives and beyond?

[TS and MT]: As we all know, the ramifications of COVID-19 are huge. When the dust settles, it will be important to have a complete record of every aspect of the government’s public response to the pandemic and how it affected policy, public advice and everyday life.

As such, the collection will be invaluable source for researchers. As with other departments at The National Archives, the UK Government Web Archive should be a leading pioneer in its field, and by developing innovative ways of capturing and investigating collections we will be in a position to share our learning with other web archives, contributing to developing practice in the field. We are also now looking into how we provide new ways of accessing content from the archive, which will open up the types of research that users can do on the material.

[SKV]: Emails are unlike many other types of digital data, representing conversations between two or more people, often in real-time, and in organizational settings in which other forms of communication may also take place, such as meetings and phone calls.

The immediacy of email messages as a historical source provides authenticity and adds a personal touch to historical events. Also, email clearly tracks the development of a conversation over time through the email thread in a way that older forms of information exchange did not, as well as generating a network of recipients and authors.

This complex nature of email means that search and discovery in such a dataset are not straightforward, but can also benefit from additional information about timing and communications networks in terms of developing a discovery tool.

The very existence of email archives is encouraging archive users to expect more from these collections. Users want to explore the network properties of collections to search for patterns across multi-custodian archives, but users also want to understand specific constraints and decisions, why someone acted one way rather than another.

For these types of questions, detailed readings and interpretations of email are necessary. Thus, a collection like the Enron dataset only becomes viable as a historical source when placed within various forms of context. The above issues makes our research important when looking at preservation and access.

[SV]: This environmental modelling project has demonstrated that, on the whole, the results we are achieving in the real world are in line with what our earlier model predicted and that our current environmental management strategies are sound. We have been one of the early adopters of seasonally adjusting our set point to take advantage of external conditions and shutting down our HVAC (heating ventilating and air conditioning) system over night and at weekends. The effectiveness of this sustainability strategy can help others in the cultural heritage sector make decisions about their own approach to environmental storage conditions.

What has been the biggest challenge of this project and what were the solutions?

[TS and MT]: We had a coming together of circumstances that made the work particularly challenging – we suddenly had to archive a greater variety of things, at a faster rate than we’d ever done, in order to capture the latest content before it changed.

This would have been challenging enough but the team had to deliver this while working remotely. We adopted lots of new things at this time, many of which didn’t work or led on to new ideas, such as expanding our use of cloud computing and researching into new and helpful tools.

Often these tools were not mature, so we found that they would change without notice, or the websites we were trying to archive would change and the tools would just stop working! There were some dead ends that we needed to back out of and look at differently. The solutions were found in the knowledge across the team and the spirit of trial and error. It led to some interesting insights and ultimately a really high-quality collection.

[SKV]: Existing tools used NLP to extract named-entity recognition, topic modelling to enable name-based searches and keyword searches. They are useful to search known keywords or establish topic models but provide no functionality to explore and ‘deep-dive’ into particular topics that are not associated with clear and previously known keywords or understand individuals and their interactions around specific issues.

Nevertheless, existing tools are useful to redact sensitive personal data and to give a broad understanding of the nature of a collection. Because emails are informal, complex and loosely structured texts, keyword search alone is often not sufficient to establish context. People do not write with keywords, and the terms that we use to describe our search interest does not necessarily match with how individuals were writing about them in their emails at the time. A keyword may be used in some contexts and a colloquial equivalent may be used at other times. As a result, we need a discovery tool that extracts the meaning of whole sentences and paragraphs. We have been quite successful in addressing these challenges through our research.

[SV]: Digital skills and knowledge have been the biggest challenge as I’m not a data scientist! To get up to speed, I did take some coding courses, which helped, and got some help from the coding community.

How will the digital methods used remain up to date over time?

[TS and MT]: That’s a very important question. Web Archiving is ever-changing by nature. It is unlikely that approaches developed over the last 18 months will still be the best ones next year, or in some cases next month! As such, things will need to be tweaked and updated over time as both web technology and crawling technology change. The advances we’ve made in response to the pandemic represent a large step forward in our capacity and will provide a solid basis moving forward.

[SKV]: During our engagement with researchers interested in looking at email, a number of issues were raised that existing tools cannot fully address, such as:

Can we search multiple inboxes of different email accounts?
How can we search across multiple custodians for conceptually related information to develop explanations?
How can we identify particular events and the range of actors involved?
How can we scour email collections to find the sequence of exchanges that led to an important decision?

In practice, it can be difficult to find answers to these questions with existing search modalities, and in our project, we sought to develop a discovery tool that seeks to find some possible ways for users to find better information for their queries.

Making use of advancements in NLP and deep learning, we developed the email contextualisation discovery tool (eConDist) to address these requirements. Our search algorithms combine the knowledge graph developed from the network properties of emails with content analysis.

eConDist offers two kinds of search over an email collection. The first type of search is targeted at novice users who are new to a particular email collection. An AI algorithm developed with the help of attention models (BERT embedding models) interprets the overall meaning of the query and provides themes of emails that align with the general meaning of the query. This type of search allows users to gain high-level insights into collections which would, in turn, encourage them to refine their search using the second type of search by formulating queries more specific to a topic or concept, with greater knowledge of the context of who was involved and when. In this way, eConDist iteratively aids users to gain a holistic picture of a concept they are searching for. As every email collection is unique, the search algorithm needs to be trained on each collection to best support the range of possible users.

Another consideration was helping to prioritize the sender/recipient’s limited attention and weaving through multiple concepts within a single mail. The search algorithm ranks emails using noisy features and allows users to select sequentially from a ranked list. Innovative technology and the latest developments in NLP are used in this research.

[SV]: The Collection Care team are already exploring the next generation of possibilities for digital modelling of our storage environments. We are looking at BIM (Building Information Modelling) and digital twins as a way of modelling and managing our environmental data. Both of these technologies are used in other industries and could be adapted to our needs.

Where can we learn more about the project?

[TS and MT]: You can learn more about the UK Government Web Archive by visiting https://www.nationalarchives.gov.uk/webarchive/. Here are a couple of examples of the effort we have made during the pandemic:

Daily captures of Public Health England’s COVID-19 dashboard, which has resulted in a comprehensive archive of the data through daily captures and the evolution in how it was presented: https://webarchive.nationalarchives.gov.uk/ukgwa/*/https://coronavirus.data.gov.uk/

Similarly, but for the GOV.UK page: https://webarchive.nationalarchives.gov.uk/ukgwa/*/https://www.gov.uk/coronavirus

And please stay tuned – we are planning to release more information and data about our COVID-19 collection in the coming months.

[SKV]: The eConDist tool is developed as a part of the AHRC US-UK partnership development funded project, ‘Historicizing the Dot.com Bubble and Contextualizing Email Archives’.

The National Archives collaborated with the following academic and research institutes:

Prof Stephanie Decker (PI), University of Bristol, UK
Dr Santhilata Kuppili Venkata (Co-I), The National Archives, UK
Dr Adam Nix (Co-I), University of Birmingham, UK
Prof David Kirsch (Co-I), University of Maryland, US

To learn more about this project, please visit the link to our project page, AURA workshop 2(2021), AURA workshop 3 (2021), Digital Archive Learning Exchange(DALE).

[SV]: I’ll be presenting some of this work at the Archives Supporting Environmental Sustainability event on 8 and 10 November 2021 [buy tickets].

To learn more about our digital research, please visit https://www.nationalarchives.gov.uk/about/our-research-and-academic-collaboration/.