Open data and archiving datasets

Considering the word ‘digital’ makes up one third of my job title, you might consider it an oversight to have not used it once in my last blog entry. That may be an indication of variety in work – or perhaps forgetfulness – but I will make up for that today when I consider the union and mutually-beneficial relationship between open data and the archiving of datasets. dataTube

A colleague recently asked me what a dataset is. This is not necessarily as simple a question as it may appear: I side-stepped. I think the answer really lies in the term ‘structured data’; namely that the text of an email could not necessarily be termed a dataset, but a table in a PDF, a CSV (Comma-Separated Values) file, or an XML (Extensible Mark-up Language) file could. Also, a dataset can be analysed quantitatively, and is not a collection of different electronic files, like a database. However, the discussion rages and the terminology is so uncertain that the Government has even consulted on the word itself.

The UK Government has made clear its wish to be ‘the most open and transparent government in the world’ and, led by a highly-motivated and ambitious team in the Cabinet Office, have produced a consultation on open data and are considering how to implement a public ‘right to data’, all while continuing to register datasets on data.gov.uk at a rate of knots.[ref] 1. Here I must declare my interests: I was part of the team for some months last year and was able to see what an innovative and impressive bunch they are. [/ref]

But what has this got to do with The National Archives? Aside from our own contributions to the Transparency Agenda and the related, and sometimes over-lapping, information management standards excellently explained last week by Robert Johnson[ref] 2. Don’t get too distracted by the Star Wars comments. [/ref], it is part of our remit to ensure access to government datasets for future generations.

The traditional route of accessioning records is that a government department created a (usually paper) file – for whatever administrative purpose – stored it for 30 years and, having considered it’s suitability regarding personal or public sensitivity, released the record to The National Archives for preservation and presentation to the public. The prevalence of the web has slightly altered the process because government interaction with the public is widespread and public records are more accessible than ever, a point discussed by my colleague Claire Newing earlier this week.

Allied with that is the acknowledgement of the benefit of publishing public data, initially in the late stages of the last government, and encouraged and developed under the auspices of the Transparency Agenda by the current Coalition. By publishing datasets – anything from a basic table on a PDF file to an RDF file, linked to other data[ref] 3. See the Public Data Principles for more, especially the five stars of linked/open data. There are a few sites which explain these, but this is my favourite because it comes in the form of an informative mug. [/ref] – government departments not only make public data more accessible to the public but also web crawlers, such as the one which provides for our very own UK Government Web Archive.

An archived dataset - can you tell the difference?

From our perspective using a web crawler to archive websites is a far simpler and safer process to capture records, especially datasets, than accessioning electronic records through other, offline, infrastructures. The quality assurance undertaken by the Web Continuity team also ensures that this process is as accurate as possible and continually improves. Another benefit is that web archiving removes the strain of preserving data from departments: they can continue to update their websites with new data, allowing us to capture and present previous data instances in the web archive, allowing the departments to reference archived versions of their own websites if necessary.

Evidently, there are still challenges to overcome, including the question of how people can and hope to interact with the vast quantities of data available on the UK Government Web Archive. However, with the necessity that the linked data community – and I can think of no better way of introducing you to it than linking to Jeni Tennison’s blog – places on provenance and context, the web archive will only increase in relevance.

So, if the data is suitable to be published online, and deemed to be public data, then the bonuses of publication are not just restricted to accountability, choice, and economic benefits, but also preservation and the possibility of comparing data over time.

Also, we are really keen to hear how people use archived government websites and data – have you used archived data in visualisations or research? Please let us know, or leave any other comments or observations below the line.

See, I used the word once this time.

9 comments

Joan says:

Sun 18 Mar 2012 at 5:18 pm

Hi just wanted to say that I like your article very much. Please keep up the good posts Thanks a ton! and Have a good day

1. Simon Demissie says:
  
  Mon 19 Mar 2012 at 12:52 pm
  
  Thanks very much for your kind comment, Joan.
  
Jeni Tennison says:

Wed 21 Mar 2012 at 12:09 pm

Nice post, and thanks for the pointer to my blog, Simon 🙂 Just thought I’d point to a couple of specific posts that might be relevant to readers:

Why Linked Data for data.gov.uk http://www.jenitennison.com/blog/node/140

explains the advantages of the linked data approach, while

Using Freebase Gridworks to Create Linked Data
http://www.jenitennison.com/blog/node/145

illustrates some of the practicalities, including the recording of provenance information. (Note that Freebase Gridworks is now Google Refine, and there plugins with additional support for generating RDF and so on.)

I hope Ross Spencer will be able to blog about machine-readable provenance trails at some point 🙂

1. Simon Demissie says:
  
  Wed 21 Mar 2012 at 1:26 pm
  
  Thanks Jeni – those links are most useful. I’ll leave that to Ross, but I’m sure he’ll be able to oblige at some point. Hope you’re well. Simon
  
Ross Spencer says:

Wed 21 Mar 2012 at 1:33 pm

Hey Jeni! – Seems I have a future topic for an article! 😀

Evelyn Hartung says:

Tue 27 Mar 2012 at 6:03 pm

My grandfather is native american & i am looking for some info. Wanted to know how to look for digital film of the archives.

Evelyn Hartung says:

Tue 27 Mar 2012 at 6:05 pm

How am i suppose to be looking for what i need to find in the archives.

1. Ruth Ford (admin) says:
  
  Wed 28 Mar 2012 at 8:00 am
  
  Hi Evelyn, I’m sorry but we will not be able to respond to research questions on the blog. Please see our moderation policy for further information. However, we have a number of research guides on our website that may help you with your next step.
  
Melinda says:

Thu 26 Apr 2012 at 12:13 am

Good job

Tags

9 comments

Leave a comment Cancel reply