Machine learning in the archives

At our site at Kew, and in deep storage in a salt mine in Cheshire, The National Archives holds a total of around 200 kilometres of paper documents spanning almost 1,000 years. Our collection grows each year and we are currently transitioning from a 30-year rule to a 20-year rule in terms of receiving new records from government departments. This means that, with regard to records we hold that are open to the public, we are notionally in the mid-1990s: at this time, printing and filing documents was still the norm and so a manual process of appraisal, selection and sensitivity review remains just about practical.

However, as we reach the late 90s and the new millennium, the records that we take in from government departments will change dramatically, as using computers became the main way of working. This means that we will be taking in a different style of record (digital rather than paper), but more importantly it means a phenomenal increase in the number of records being accessioned. To put the transition from a paper world to a digital one into perspective, IBM estimates that 90% of the world’s data was created in the last two years alone[ref]http://www-07.ibm.com/in/solution/10_Key_Marketing_Trends_for_2017.html [/ref].

Every record taken in needs to be appraised, selected and to undergo a sensitivity review: if we are to manage the sheer volume of work, we will need the assistance of machines to support human reviewers. This is a huge, technology-enabled transformation in the way that we work; to adapt to this, we have begun developing the digital skills of our teams.

Machine learning hackathon

On 7 and 8 December 2017, 35 members of our digital teams took part in a machine learning hackathon. This event was designed to demystify something that most people assume is very complex: all the staff involved were beginners in machine learning and the event kicked off with introductions to algorithms and different techniques.

Staff participating in a digital workshop

A digital skills workshop at The National Archives, March 2018

Different teams put their learning into practice and experimented with a wide range of data, tackling various problems faced in the preservation of and access to digital records.

Topic modelling[ref]http://www.matthewjockers.net/macroanalysisbook/lda/[/ref] was used to find key phrases in Discovery record descriptions and enable innovative exploration of the catalogue; and it was also deployed to identify the subjects being discussed across Cabinet Papers. Other projects included the development of a system that found the most important sentence in a news article to generate automated tweeting, while another team built a system to recognise computer code written in different programming languages – this is a major challenge for digital preservation.

Beyond the experimentation that took place during the hackathon, we also had a non-technical team that explored the ethical implications of the use of machine learning in archival practice.

Where next?

As a result of the hackathon, our digital teams not only developed new skills but also identified two areas for further investigation: automated recognition of coding languages,and topic modelling catalogue descriptions.

In addition to this, we are co-supervising PhD projects on sensitivity review and understanding large-scale web data, and we are also experimenting with handwritten text recognition technology.

2 comments

David Matthew says:

Fri 8 Jun 2018 at 12:31 am

It is impossible for computers to do the job of humans on sensitivity reviews as it depends on what the documents actually say and not broad subjects and you cannot do that without physically reading them. I think you are overestimating the number of digital-born records that the departments/agencies are going to transfer, key descriptions in Discovery are so bland ( a number have the word “General”) that you will not get to the bottom of the subjects of the records. The result will be the release of records that endanger the UK’s defence and will be in breach of data protection (even more than they already are).

Amanda says:

Wed 13 Jun 2018 at 11:33 am

We have been trying to tell organisations about this for years…they either dont get it or choose not to.

Machine learning in the archives

Machine learning hackathon

Where next?

Tags

2 comments

Leave a comment Cancel reply