If you wanted to watch one of your old video tapes, would you be able to? Do you still watch your DVD collection, or do you prefer to stream movies now?
The technology we use at home is constantly modernising and many of our nostalgic favourites are now unusable – you would struggle to fit a video tape into your DVD player.
This problem is a familiar one when it comes to digital archiving, too. The types of files kept on a computer are evolving all the time and many ultimately become obsolete. This means that, if we are preserving a digital file from 1998, then we might need to migrate the content of that file to a new format to make sure that we can still access it in ten years’ time.
Making this type of change to a digital record is an important one – but if we can make changes to records for the better, then somebody else is probably able to make changes to that record for the worse too.
This is one of the biggest challenges in digital archiving: the records are intangible and so they could be vulnerable to changes that might be made without detection and on a huge scale. It means that the records you are looking at might not the same as the records that were archived 20 years ago. In other words history, might be rewritten.
It is this challenge that the research project ARCHANGEL is investigating, funded by The Engineering and Physical Sciences Research Council (EPSRC). The National Archives, the University of Surrey and the UK Open Data Institute are trying to answer several big questions:
- How can we demonstrate that the record you see today is the same record that was entrusted to the archive 20 years previously?
- How do we prove that the only changes made to it were legitimate and have not affected the content?
- How do we ensure that citizens continue to see archives as trusted custodians of the digital public record?
To address these questions, ARCHANGEL is exploring how we can know that a digital record has been modified and whether the change was legitimate so that ultimately it can still be trusted as the authentic record. Specifically, the project is investigating how blockchain might be used to achieve this.
Blockchain is a distributed ledger technology which is both tamper-resistant and decentralised. The ARCHANGEL project is creating a prototype using this technology which aims to enable archives to generate and register hashes of documents (similar to unique digital signatures) into a permissioned blockchain (in other words, one which can only be added to by authorised organisations). Where the record has been legitimately changed, hashes of the content, alongside hashes of the code used to make the change, can also be registered on the blockchain. This would mean that whenever a digital record is modified, an audit trail is created and we are able to know exactly how a document has been edited.
Our approach will result in the creation of many copies of a persistent and unchangeable record of the state of a document. This record will be verifiable using the same cryptographic algorithms, many years into the future.
As this approach matures, we hope that the ledger would be maintained collaboratively by distributing it across many participating archives both in the UK and internationally, as a promise that no individual institution could attempt to rewrite history. This technology could transform the sustainability of digital public archives, enabling archives to share the stewardship of the records and, by sharing, guarantee the integrity of the records they hold.
Technology is consuming archives, we can only rely on data protection improvement
Thanks for sharing. I did ask this question to Jamie Fawcett too, should we consider the possibility of the hash collision substantially or this possibility is rarely seen? (shattered.io could give us some opinion about this issue)
Currently we are not too concerned about checksums clashing. This answer to the question why clashes haven’t been found is reassuring: https://crypto.stackexchange.com/questions/47809/why-havent-any-sha-256-collisions-been-found-yet. But we are thinking about the situation where we would need to move to a different algorithm as part of the project as we know this has happened in the past, for example with MD5.
Fossology (used to?) use sha1.md5 to store files to decrease the likelihood of collisions on disk. I can’t offer an opinion on if its superior to using only sha256 only.