Modelling our digital archival data
The National Archives’ Digital Strategy (2017-19) identifies the challenges we face as we become a second-generation digital archive. The strategy envisions an archive that is ‘Digital by Design’, able to preserve and provide access to a wide range of rich digital records which better reflect the workings of a digital government. In particular, the strategy highlights the need for a new way of providing access to digital archival records.
The digital records we’ve received so far are more diverse than physical records. They are not just documents but can include all sorts of other content, from threaded discussions using a web-based tool to video, websites, structured datasets or even computer code. These records can be complex and often consist of different components, potentially with different creators and owners. And we know that ever more specialised formats will need to be added to our collection over the coming years.
However, these new digital objects are public records in the same way as the paper files they have replaced, and we need to make them at least as accessible as analogue records are now. In doing this, we cannot simply take the standards, processes and tools we use in the paper world and apply them in the digital world – the two are significantly different.
We have started exploring how we can provide access in a new way which has been designed from the outset to fit the way our users want to work with digital records today and will support how they may want to work with them in the future. A key difference is that, while we will still offer a ‘view’ of individual records for our ‘readers’, we must also make records available for computational analysis, enabling our ‘data users’ to work with records at scale and ask very different types of research questions.
At the same time, we will actively process the collection ourselves, to enrich descriptions and contextualise the records. This activity will produce information of a different size and shape to our traditional catalogue descriptions. For example, we can imagine contextualising records through their links to other resources, often held by other institutions, or through enriching descriptions to make the records more discoverable – or by applying probabilistic techniques that embrace the uncertainty that is typical of historical records.
As we’ve started work on this, a key building block that we’ve identified is the need for a model that describes our records as data. Our existing catalogue simply will not hold the range of information we now need to manage: we need to adapt to accommodate a much richer understanding of our records.
In doing this, we want to learn from others with similar challenges, re-use work that has already been done and innovate to create new approaches where needed. There are a lot of data models out there. Some of these were created with archival collections in mind (such as RiC and PREMIS) while many come from other domains with similar challenges (such as METS or FRBR). There are even models which try to bring together multiple approaches (EDM). All of these offer a particular way of thinking about or describing digital collections.
We are currently looking at the kinds of information we want to hold – from metadata that was created by the originating government department/organisation, or values that are intrinsic to the digital objects (such as geocoding in images), to properties later externally derived about the objects either singly or in aggregation. Our model will embrace uncertainty, temporal variation and user-contributed content.
If you are working on similar issues, or you have a particular interest in any of the themes outlined above, we would love to hear from you. Please either comment below or email email@example.com.