Identifying digital file formats – a collaborative effort
Have you got a file format you can’t identify? Send it to PRONOM.
PRONOM is an online file format registry, created and managed by The National Archives, containing details of more than 1,300 different digital file formats. It enables a wide range of users from across the world to find out what digital files they have and in which formats.
Knowing what file formats you have is an important step in constructing a digital preservation strategy. PRONOM can help you determine if your ‘.doc’ file is a Microsoft Word Document 97-2003 or a WordPerfect for Windows Document v6.0. This can then help you decide the appropriate digital preservation actions to be taken for the file formats under your care.
PRONOM is embedded in the majority of digital preservation systems such as Archivematica, Libnova, Preservica, Roda and Rosetta.
Every format which is added to PRONOM is given a Persistent Unique Identifier (PUID). Each entry can contain information about the file format, including a description of the format and any mime/media types (file format classifications used by web browsers to identify how to open particular files, assigned and listed by the Internet Assigned Numbers Authority).
Currently we are focused on developing signatures which will allow format identification. PRONOM uses both external and internal signatures. External signatures are based on the file extensions – but, as you can tell from the ‘.doc’ example mentioned above, these are not always fully reliable or specific. The internal signature looks at the byte sequence of the file; these offer the most accurate form of identification.
Around 70% of PUIDs in the PRONOM database have internal signatures. The more signatures there are in the database, the more accurate the DROID tool is at identifying files, and the greater its use to the digital preservation community.
The National Archives is committed to adding at least 100 new formats (PUIDs) to PRONOM per financial year. These are released periodically; the graph below shows the steady growth of PRONOM since 2005.
There are thousands of file formats, so we have to target our resources carefully to ensure PRONOM covers the most important formats for The National Archives and the digital preservation community. Our first priority is being able to identify all file formats transferred to us by UK Government departments. The second driving force of our research comes from the digital preservation community; this plays a considerable part in allowing us to reach our target of 100 new PUIDS each year. We actively encourage organisations or individuals dealing with digital records to help provide information on the formats they encounter. We prioritise submissions of potential signatures over other format information.
The map below shows the wide spread of regular contributors from around the world who have played a vital part in expanding the coverage of the signatures included in PRONOM.
Contributors include The British Library, National Library of New Zealand, The Church of Jesus Christ of Latter-day Saints and many more – see a full list of contributors.
If you would like to be a contributor or you have a query, we have a submission form as well as a mailbox. If you would like us to research a particular format then the more information you can provide the better. It is particularly useful if you are able to provide samples of the file format you are interested in. If you know of any file format specifications which exist then sending us a copy or a link to these documents is also very useful. The more information we receive the quicker we will be able to add a file format to PRONOM.
We are very happy to receive signatures which have been built by interested parties. If you would like to find out more on how to develop signatures of your own, then you can see guidance on how to undertake file format research.
You can also read previous blogs which provide more information about the PRONOM community and how to look at the internal code of a file to identify a binary signature: