Testing new ‘digitisation on demand’ tools
The National Archives was recently privileged enough to host ‘an international scanathon’, organised by Dr Louise Seaward, University College London.
The purpose of the scanathon was to test new archival tools for ‘digitisation on demand’ developed as part of the READ project (Recognition and Enrichment of Archival Documents), funded by the European Union’s Horizon 2020 research and innovation programme.
At the event an interdisciplinary group of archivists, historians, collections experts and data scientists learned about and experimented with the DocScan mobile app and the ScanTent device. The group also had the chance to discuss the capabilities of these new tools and gave their feedback on their suitability for archival work.
Since the event took place in the week leading up to International Archives Day, the scanathon also included a live link-up with parallel scanathons taking place in two other European archives. We connected with The National Archives of Finland and The State Archives of Zurich to share our experiences of the two new digitisation tools and also had the opportunity to see some of the records held in their collections.
How do the DocScan mobile app and ScanTent work?
The ‘scanathon’ was an occasion for users to test and give feedback on prototypes of the DocScan mobile app and the ScanTent device, two brand new tools that are designed to help archival users take high quality images of historical documents with their mobile phone.
The ScanTent consists of a tented piece of fabric, with internal LED lights and an anti-slip pad at the top. Users place documents inside the tent and then position their phone on top of the ScanTent, ready to take photos. The ScanTent creates a fixed distance between document and camera, provides a consistent source of light and leaves users with their hands free to turn pages or move documents around.
The advantages of the ScanTent become even greater when it is used together with the DocScan app (free of charge and on Android only). DocScan is ideal for those who wish to work hands free with the ScanTent because it has an auto-shoot feature that will take a photo every time a page is turned. DocScan also gives users the option to upload their images directly to the Transkribus platform, where they can be used as training data for Automated Text Recognition.
What is Transkribus?
Transkribus is a transcription platform which enables the automated recognition, transcription and searching of both printed and handwritten historical documents of any date, language or style. The software is at the centre of the READ project, an EU-funded initiative which aims to revolutionise access to archival material through the development and dissemination of Automated Text Recognition and other cutting-edge tools. DocScan and the ScanTent have been developed by the Computer Vision Lab at the Technical University of Vienna, as part of the READ project. By facilitating the digitisation of historical documents, they too aim to enhance the accessibility of global cultural heritage. These innovative contributions to the processes of digitisation, transcription and search are finding a wide audience among researchers, archivists and members of the public.
Images taken with DocScan and the ScanTent are of sufficient quality to be used as basis for training an Automated Text Recognition model to transcribe and search a collection of documents, with impressive results. Automated transcripts of handwritten material with a Character Error Rate (CER) of 5-10% are possible, meaning that the software is capable of accurately transcribing 90-95% of characters written in a given hand. Users can then continue to work with these automatically generated transcripts in Transkribus – searching, editing, correcting, enriching them with tags and exporting them in various formats.
Anyone interested in testing out a ScanTent or finding out more about the future of Transkribus, is welcome to join the project team at the second Transkribus User Conference this November in Vienna.
How is The National Archives using Transkribus?
At The National Archives, we have been experimenting with how Transkribus can open up access to our collections in new ways. We have been testing Handwritten Text Recognition (HTR) software on our collection of wills (catalogue reference: PROB 11), chosen due to their structured language and relatively uniform hand.
As we increased the number of words used in training each model, we found significant improvement in both Word Error Rate (WER) and Character Error Rate (CER). More recently, a third model has been trained based upon a ground truth of 100,000 words and we hope to get a CER of under 10%, which should be good enough to begin to work meaningfully with the outputs.
The pilot project has been a great success in showing that HTR can work on difficult handwritten texts. Going forward, the key questions for us at The National Archives are:
- How can we make the process less labour intensive?
- What are the best ways to exploit the output – utilising its strengths in terms of scale whilst acknowledging it will not be perfectly accurate?