In my last blog post I wrote about how the UK government Web Archive (UKGWA) can be a useful resource for contemporary historians. It represents changing forms of communication used by the government and an opportunity to compare an official record provided at the time with a secret official record released 20 years later.
In this post I would like to highlight how the UKGWA can be used to draw broader conclusions from the official record, particularly in terms of language and trends, via data analysis. The UKGWA comprises over 3 billion URLs from around 3,000 websites captured from 1996-2014. This adds up to around 90 terabytes of data, and around 2.5 billion webpages. Even the most enthusiastic researcher would struggle to read that much text.
However, there are of course tools to extrapolate meaning from such large datasets and thanks to the work of web science PhD student Ian Brown we were recently able to present UKGWA data to the University of Southampton’s Web Science Research Week. This provided us with the opportunity to work with a group of highly talented web science PhD students – and Justin Murphy, Lecturer in Politics and International Relations – to look into the possibility of building a tool to make better use of the UKGWA data.