Considering the word ‘digital’ makes up one third of my job title, you might consider it an oversight to have not used it once in my last blog entry. That may be an indication of variety in work – or perhaps forgetfulness – but I will make up for that today when I consider the union and mutually-beneficial relationship between open data and the archiving of datasets.
A colleague recently asked me what a dataset is. This is not necessarily as simple a question as it may appear: I side-stepped. I think the answer really lies in the term ‘structured data’; namely that the text of an email could not necessarily be termed a dataset, but a table in a PDF, a CSV (Comma-Separated Values) file, or an XML (Extensible Mark-up Language) file could. Also, a dataset can be analysed quantitatively, and is not a collection of different electronic files, like a database. However, the discussion rages and the terminology is so uncertain that the Government has even consulted on the word itself.
The UK Government has made clear its wish to be ‘the most open and transparent government in the world’ and, led by a highly-motivated and ambitious team in the Cabinet Office, have produced a consultation on open data and are considering how to implement a public ‘right to data’, all while continuing to register datasets on data.gov.uk at a rate of knots. 1
But what has this got to do with The National Archives? Aside from our own contributions to the Transparency Agenda and the related, and sometimes over-lapping, information management standards excellently explained last week by Robert Johnson 2, it is part of our remit to ensure access to government datasets for future generations.
The traditional route of accessioning records is that a government department created a (usually paper) file – for whatever administrative purpose – stored it for 30 years and, having considered it’s suitability regarding personal or public sensitivity, released the record to The National Archives for preservation and presentation to the public. The prevalence of the web has slightly altered the process because government interaction with the public is widespread and public records are more accessible than ever, a point discussed by my colleague Claire Newing earlier this week.
Allied with that is the acknowledgement of the benefit of publishing public data, initially in the late stages of the last government, and encouraged and developed under the auspices of the Transparency Agenda by the current Coalition. By publishing datasets – anything from a basic table on a PDF file to an RDF file, linked to other data 3 – government departments not only make public data more accessible to the public but also web crawlers, such as the one which provides for our very own UK Government Web Archive.
From our perspective using a web crawler to archive websites is a far simpler and safer process to capture records, especially datasets, than accessioning electronic records through other, offline, infrastructures. The quality assurance undertaken by the Web Continuity team also ensures that this process is as accurate as possible and continually improves. Another benefit is that web archiving removes the strain of preserving data from departments: they can continue to update their websites with new data, allowing us to capture and present previous data instances in the web archive, allowing the departments to reference archived versions of their own websites if necessary.
Evidently, there are still challenges to overcome, including the question of how people can and hope to interact with the vast quantities of data available on the UK Government Web Archive. However, with the necessity that the linked data community – and I can think of no better way of introducing you to it than linking to Jeni Tennison’s blog – places on provenance and context, the web archive will only increase in relevance.
So, if the data is suitable to be published online, and deemed to be public data, then the bonuses of publication are not just restricted to accountability, choice, and economic benefits, but also preservation and the possibility of comparing data over time.
Also, we are really keen to hear how people use archived government websites and data – have you used archived data in visualisations or research? Please let us know, or leave any other comments or observations below the line.
See, I used the word once this time.
- 1. Here I must declare my interests: I was part of the team for some months last year and was able to see what an innovative and impressive bunch they are. ^
- 2. Don’t get too distracted by the Star Wars comments. ^
- 3. See the Public Data Principles for more, especially the five stars of linked/open data. There are a few sites which explain these, but this is my favourite because it comes in the form of an informative mug. ^