PRONOM: A database centenary

Recently The National Archives released the 100th version of PRONOM but many people may never have heard of it. You may have used it unknowingly when using a file format identification tool such as DROID or Siegfried; it might also be built into your digital preservation system. In this blog post we want to explore PRONOM through the lens of the past, present and future.

What is PRONOM?

PRONOM is a database in which we put information about various file formats. In PRONOM we try to describe some key information about each file format, the typical extensions used and, most importantly, we find key information that can be used to identify file formats.

A screenshot containing some of the information we have collected about png files taken from the PRONOM website

It may be new to some people but all files and systems on a computer are made up of binary (the 01100010 01101001 01101110 01100001 01110010 01111001 often seen in images associated with computers), which in turn can be translated into another ‘computer language’ called hex (or 68 65 78).

Software is programmed to understand the structure of the file format and therefore show a file as an image, and not a stream of binary. A direct comparison can be seen with the human brain reading a letter. We can disseminate what text relates to what information easily, with an understanding that the top right will contain the sender’s address and the end of the letter will be signed by the sender. So in the same way that software will render files, a human will read and understand a physical record.

A diagram showing the similarity of the human brain to software programming, featuring a letter from the HO13 collection

The PRONOM team analyses file formats using a hex editor, which opens file formats in such a way that you can only see the hex. The work involves looking for recurring patterns in the hex, which we call magic bytes. The work also involves looking at the file format specifications, finding file formats using GitHub or the Wayback Machine, and working with external institutions on unusual patterns or unidentified file formats within their collections.

How a file looks in a hex editor as opposed to being opened in image rendering software. As a side note it is worth following #dogsofdigitalpreservation on Twitter

Repeating patterns within the hex are a core part of PRONOM research. By collecting this information, software tools such as DROID can search through the hex of files and be able to identify them.

A screenshot of the file format identification tool DROID. DROID is free to download here and you could try running it over your files to see what you have

Why is this work important? For the same reason it is essential for a conservator to know if a physical record is vellum or paper, or if the materials used for writing are watercolour or iron gall ink. Digital preservation experts have to know what file formats are in their collection and what software can open them in order to effectively preserve them for future generations.

Past

Around the year 2000 we realised as an institution that we had a wealth of knowledge about the conservation and preservation of physical items – papers, parchments, photographs and much more – but we didn’t have the same level of knowledge about the preservation of digital items. We therefore decided to create a registry of file format information to build that knowledgebase to help us with understanding associated digital preservation risks, in order to aid digital preservation planning. PRONOM was born.

We made PRONOM publicly available in 2005, along with the first release of DROID, our file format identification tool. From the outset we realised we couldn’t build this knowledgebase on our own, so we invited experts from other institutions to contribute.

To date, we have received contributions from over 80 institutions, ranging from archives and libraries, through to museums and educational institutions, and even private companies who are interested in the long-term preservation of digital records. PRONOM is used all over the world, mostly in the fields of digital preservation, information management and digital forensics.

Present

Now we hear from our file format analysts currently conducting file format research.

Kathryn:

As well as being a vital part of digital preservation, file format research can actually be a pleasant way to spend an afternoon. Finding patterns which you can use to formulate a PRONOM entry – to some it might sound boring, but many people find the opposite.

Files range in complexity; they can be very simple (thank you to all the developers who include file type and version number information in their file headers!), but there is always a file format which will engage anyone from the novice to the expert.

The file type and version has been clearly added to the hex values right at the start of the file. Joy!

If you are someone who enjoys close work and looking for patterns, few things are as satisfying as getting a positive identification of your file samples using a signature you’ve just created, and that’s before a signature has gone out the wider world to help with file preservation! I realised recently that I was spending my days looking at byte sequences, and my nights doing cross stitch. There are similarities between the two: looking and following patterns meticulously, with a view to a completed picture, or signature. If you like cross stitch, you might your like file format research!

Cross stitch: more similar to file format research than you’d think

We also receive queries from those who have files that are in the PRONOM database, but which aren’t identifying in DROID. This would typically involve getting samples of the unidentified files from the enquirer and looking at them in a hex editor. Often there is a problem with the offset of the bytes in the sequence. For example, we may have said that the file byte sequence starts at the absolute beginning of the file, but ‘in the wild’, different software and operating systems have led to one extra byte at the beginning of the file, leading to it not being identified. Rewarding work. Or the byte sequence itself might need tweaking, to allow for slight variations, for example some files might say ‘Adobe Air 2’, but an example is found which says ‘Adobe Air 2.0’. We tweak the signature file to allow for these variants and make test signatures. These changes would go live in the next PRONOM release, and it is very gratifying to be able to tell the enquirer that we have identified and fixed their file format problem.

Andrey:

Working to improve the PRONOM database by conducting file format research has been a hugely rewarding experience since I joined the team last September. Coming from a technical background with my undergraduate degree in electronic engineering, I relished the computer programming course modules, throughout which I was frequently challenged by the necessity to debug broken code. The ability to detect patterns through scrupulous analysis has proven an immensely useful transferable skill while working on creating new and improving existing file format signatures.

With some existing understanding of file format structure but no prior experience of using a hex editor, I have been surprised by the fascinating information that is contained in files, often hiding in plain sight. My experience with personal computers stretches back only to the days of Windows 3.1, while a multitude of submissions from external users have referenced formats for the likes of Atari ST, MSX and Commodore 64. Frequently, legacy graphic formats have been developed using well-known compression algorithms such as Run-length encoding and Lempel-Ziv-Welch, but custom-made decompression tools are common too, offering us a useful reminder that in the recent past, storage was limited, and clever workarounds were especially needed to maximise the potential of machines and networks with limited capabilities.

On a personal level, it is tremendously satisfying to contribute to an ever-improving database which feeds into DROID, a ubiquitous tool developed by fellow National Archives colleagues, embedded in digital preservation systems such as Preservica and Archivematica. PRONOM’s ability to identify files through recurring signatures using DROID makes it an invaluable resource in file format identification, which is critical for the digital preservation community to retain a collective footprint of early records at risk of technological obsolescence. Characterisation and preservation planning are essential to making critical information available for future generations, and it is humbling to know that the work we do as a team can make the difference between important files being accessible or inaccessible in the long term.

Andrea

File format research may seem like a daunting task at first, but researching a variety of file formats comes with many unexpected benefits. With little regulation on the number of file formats especially pre-1980s, the extensive number of peculiar file formats has been exciting part of my research.

With that in mind, the most valuable insight I’ve learned in my role as a File Format Analyst has been to see how technology has evolved even during its early modern developments. The progressions particularly linked to the gaming industry have had a valuable impact on the variety of formats which we use today. With my previous background in the history of art, I was intrigued to learn about ASCII and ANSI art movements which widely spread during the bulletin board systems era. Due to limitations of computers, ASCII art was widely used in advertisements as well as among enthusiasts in online forums at the time. Today, an archive for ASCII art can be viewed online, which often contains fun and impressive imagery created using ASCII characters.

ASCII art could be interpreted as the predecessor of emoticons, which we now use on a daily basis. With emoticons being the fastest growing language today it’s intriguing to see how their origins developed. ASCII art shows us how visual language is vital in our culture and the importance of creating images even with limited resources. In its early days, ASCII art allowed fun images and expressions to be produced for a more enhanced user experience when playing computer games. Later, this evolved to express how one feels in text-speak, making communication clearer when only using limited technology.

File format research inspired me to create my own ASCII art, go ahead and try it!

With PRONOM celebrating its 100th release, file format research shows us how PRONOM is an invaluable tool with which to tell the story of technology and preserve its history for future generations.

Future

We are excited for the future of PRONOM, and we are currently working on a project to transform the database from the old system to a linked graph database. We hope to release an updated and more accessible interface to the website on the front end and will have more flexibility with what we can do with the data on the backend. The graph database will allow us to link up with other file format registry systems across the world such as the Library of Congress’ Sustainability of Digital Formats and Wikidata. Our team is expanding, allowing for more releases and more file format research.

PRONOM already has a large following within the digital preservation community but we want to increase awareness within other user groups. PRONOM is open source and accessible, and we rely on all the dedicated researchers who submit, point out adjustments and report errors, so we hope you too will join the conversation.

Tags

Leave a comment

Your email address will not be published.

We will not be able to respond to personal family history research questions on this platform.
See our moderation policy for more details.