Today marks the official release of a new digital preservation tool developed by The National Archives, CSV Validator version 1.0. This follows on from well known tools such as DROID and PRONOM database used in file identification (discussed in several previous blog posts). The release comprises the validator itself, but perhaps more importantly, it also includes the formal specification of a CSV schema language for describing the allowable content of fields within CSV (Comma Separated Value) files, which gives something to validate against.
The primary purpose of the validator is to check that the metadata we receive with files for preservation meets our requirements. CSV can very easily be produced from standard text editors and spreadsheet programmes (the technical knowledge required for XML has proved a stumbling block). These tools are familiar to most computer users already. On the other hand, this ease of editing means that errors can easily creep in. Historically there hasn’t been much standardisation of CSV files, though the name suggests that a comma is used as the separator between fields, TSV (tab separated values) is also quite common, and other separators are far from rare. The closest thing to a true standard is RFC 4180, so we have made this our default position for files. Even then, there was no agreed way to describe the content of fields, for example, to specify that one field should contain a date (in a particular range), another an address, another a catalogue reference and so on. Continue reading »