Machine learning in the archives
At our site at Kew, and in deep storage in a salt mine in Cheshire, The National Archives holds a total of around 200 kilometres of paper documents spanning almost 1,000 years. Our collection grows each year and we are currently transitioning from a 30-year rule to a 20-year rule in terms of receiving new records from government departments. This means that, with regard to records we hold that are open to the public, we are notionally in the mid-1990s: at this time, printing and filing documents was still the norm and so a manual process of appraisal, selection and sensitivity review remains just about practical.
However, as we reach the late 90s and the new millennium, the records that we take in from government departments will change dramatically, as using computers became the main way of working. This means that we will be taking in a different style of record (digital rather than paper), but more importantly it means a phenomenal increase in the number of records being accessioned. To put the transition from a paper world to a digital one into perspective, IBM estimates that 90% of the world’s data was created in the last two years alone 1.
Every record taken in needs to be appraised, selected and to undergo a sensitivity review: if we are to manage the sheer volume of work, we will need the assistance of machines to support human reviewers. This is a huge, technology-enabled transformation in the way that we work; to adapt to this, we have begun developing the digital skills of our teams.
Machine learning hackathon
On 7 and 8 December 2017, 35 members of our digital teams took part in a machine learning hackathon. This event was designed to demystify something that most people assume is very complex: all the staff involved were beginners in machine learning and the event kicked off with introductions to algorithms and different techniques.
Different teams put their learning into practice and experimented with a wide range of data, tackling various problems faced in the preservation of and access to digital records.
Topic modelling 2 was used to find key phrases in Discovery record descriptions and enable innovative exploration of the catalogue; and it was also deployed to identify the subjects being discussed across Cabinet Papers. Other projects included the development of a system that found the most important sentence in a news article to generate automated tweeting, while another team built a system to recognise computer code written in different programming languages – this is a major challenge for digital preservation.
Beyond the experimentation that took place during the hackathon, we also had a non-technical team that explored the ethical implications of the use of machine learning in archival practice.
As a result of the hackathon, our digital teams not only developed new skills but also identified two areas for further investigation: automated recognition of coding languages,and topic modelling catalogue descriptions.