CAFED00Ds and CAFEBABEs

Wouldn’t it be cool if every digital file created could be identified with a signature or ‘magic number’ of some kind? This would make preservation, and the concept of knowing what you’ve got in order to be able to preserve it, that much easier.

By design or otherwise, for some file formats this is actually the case. The title of today’s blog post provides two such magic numbers used to identify Java class objects and Java pack200 files. These aren’t the first examples to use magic numbers but 0xCAFEBABE is the one I find the most striking as an introduction to the concept. You can read more on the origins of 0xCAFEBABE at Wikipedia.

0xCAFED00D and 0xCAFEBABE Hexadecimal Bytes

When we talk about digital signatures in the context of the work we do in Digital Preservation we are referring to these magic numbers, a unique pattern of bytes which we read from a digital file to help us identify its file format and version. The 32-bit hexadecimal numbers 0xCAFEBABE and 0xCAFEDOOD stored in PRONOM and taken verbatim would allow us to identify both file types above respectively.

The basic premise of the digital signatures used by DROID is to find these unique and constant patterns of bytes within a file. They are downloaded from PRONOM and stored in a ‘signature file’ which is referenced by the tool. Ideally these bytes also provide extra information which helps us to obtain more granular identifications.

The constant patterns we look for are linked together by regular expression syntax which allow us to model the unpredictability of the linking bytes, so for example, to connect two patterns, one a sequence at a beginning that tells us the primary type of file, the other further into the file stream, a snippet of XML that tells us something about the version, we would connect the two sequences using a wildcard (*) character.

Theoretically this syntax allows us to create unique and robust signatures. It is important that we can identify a discrete format from all others without our signatures colliding or incorrectly identifying another format. In practicality when we first create a new signature this is difficult, but the more signatures we create and the more testing we can do on larger sets of files, the better we can refine this work.

An example of signatures that work particularly well are those that relate directly to the programming structures output by the creating software, a struct is often a structured type that aggregates different objects and data types of fixed and variable sizes (in bytes). These structures translate trivially to the syntax used by DROID.

An example where we can see such structure appear in a file format is version two of the GRIdded Binary format from the World Meteorological Organisation which describes its structure thusly:

Bytes  1-4   "iGRIBi" (ASCII)
       5-6   Reserved
       7     Discipline
       8     GRIB Edition (version) number (currently 2)
       9-16  Total length of GRIB message in octets

 

We ignore bytes 9-16 as they are variable and not linked to another static sequence, and we observe three bytes we can’t fix easily at positions five through seven. As such, this generates a signature which will be found at the very beginning of the file that looks like: 47524942{3}02

In itself this wouldn’t be the strongest of signatures. To strengthen it the format also describes a structure that remains constant at the very end of the file:

Bytes  1-4   "7777" (ASCII)

 

Which is simply coded as 37373737 using our signature syntax. Et Voila! We have a signature. This should uniquely identify any GRIdded Binary files from any other files should they exist within any collection DROID is run across.

That pretty much describes what DROID is looking for when it scans digital files. When we delve into this research further we learn it isn’t quite that simple. Initially we must learn how to express structures as effectively as possible using the DROID regular expression syntax and once we have a signature like that for GRIdded Binary it is then extremely important to test this signature over as many different types of file as possible looking for incorrect matches. We also seek feedback from external partners and users of DROID. This is an iterative process that results in the refinement of the signature over time, hopefully, improving the quality of the PRONOM database as it grows to accommodate the number of new file formats created daily.

For more information, there are some useful links below. If you find yourself inspired maybe you can create some signatures for yourself using these links, a hex editor and this useful signature development utility!

’til next time!

0xCAFED00D

Additional information:

The National Archives: How to research and develop signatures for file format identification

Open Planets Foundation: Guide to writing new format signatures

Open Planets Foundation: Call for a test set of files 

International Journal of Digital Curation (IJDC): Towards the Development of a Test Corpus of Digital Objects for the Evaluation of File Format Identification Tools and Signatures

Unix File utility: http://en.wikipedia.org/wiki/File_(command)

Hex speak: http://en.wikipedia.org/wiki/Hexspeak

 

2 comments

  1. […] the key characteristics that make the file format what it is, as described in Ross Spencer’s recent blog post. I then recreate the key byte sequences, test them against sample files and upload them to PRONOM, […]

  2. […] the data structures used by the software application,  (as discussed in part in a previous blog for The National Archives), to be loaded back into memory to be acted upon at a later […]

Comments are closed.

We will not be able to respond to personal family history research questions on this platform.
See our moderation policy for more details.