In my last blog post I wrote about how the UK government Web Archive (UKGWA) can be a useful resource for contemporary historians. It represents changing forms of communication used by the government and an opportunity to compare an official record provided at the time with a secret official record released 20 years later.
In this post I would like to highlight how the UKGWA can be used to draw broader conclusions from the official record, particularly in terms of language and trends, via data analysis. The UKGWA comprises over 3 billion URLs from around 3,000 websites captured from 1996-2014. This adds up to around 90 terabytes of data, and around 2.5 billion webpages. Even the most enthusiastic researcher would struggle to read that much text.
However, there are of course tools to extrapolate meaning from such large datasets and thanks to the work of web science PhD student Ian Brown we were recently able to present UKGWA data to the University of Southampton’s Web Science Research Week. This provided us with the opportunity to work with a group of highly talented web science PhD students – and Justin Murphy, Lecturer in Politics and International Relations – to look into the possibility of building a tool to make better use of the UKGWA data.
Web science itself is a relatively young discipline and, as often mentioned by the eminent speakers over the research week – including Professor Dame Wendy Hall, Professor Sir Nigel Shadbolt, and Professor Susan Halford – it is an intrinsically inter-disciplinary study. The basic premise of the subject, as suggested by Halford, seems to be the requirement to understand that the web is neither technical nor social, but both; it shapes and is shaped by society. Judging by the wide variety of subjects studied by the groups during research week – from tracing sales and mentions of ‘legal highs’ on the web, to querying how more diverse user groups can be encouraged to participate in crowdsourcing projects – this cross-disciplinary approach is well understood by the current cohort of students.
The group working on UKGWA data quickly set about understanding the options available to them. As they shared interests in the language of political discourse they decided to explore whether it would be possible to study the frequency that certain keywords related to the economy (unemployment, wage, profit, etc.) were used in the UKGWA either side of the financial crisis of 2008. To do this they used the UKGWA full-text search API, as that would in essence provide a bulk result of all the text in the web archive.
As ever with experimental research, things did not always proceed as seamlessly as desired and for a full ‘warts-and-all’ account of the project please see our Hackpad for the week. However, many positives came out of the research. Firstly, some graphic depictions of the use of keywords were generated. Dr Murphy interrogated the XML data using the programming language, R, and generated a series of graphs which were cross-referenced with real rates of unemployment, wages, and so forth from the Office for National Statistics. 1 Chris Phethean created a visualisation for this data in the Shiny application and Phil Waddell wrote a theoretical piece on the impact of the research.
Secondly, a process for generating this data – and the possibility of further querying of the UKGWA in this way – has been established, and with some further work could be extended to include other subjects and specific timeframes. We found working with the representatives at Southampton enlightening, and from just one week of experimentation and interrogation we have been able to make improvements on the full-text web search and the full-text search API, and are in a position to better understand what researchers in this field would hope for and expect.
Thanks to the work of the students and staff involved with the Web Science Research Week we have learnt more about how wider conclusions can be drawn from the UKGWA: that it can tell us more than just the words written in it. Similar work is taking place elsewhere, such as the BUDDAH project using UKGWA data, led by the Institute of Historical Research, the British Library, and the Oxford Internet Institute. These projects are creating further possibilities for contemporary historians – alongside students of politics, web science, sociology, and so forth – to better understand the web, how people use it, and the workings of government and its interaction with society.
If you are interested in conducting any research using UK Government Web Archive data please get in touch via the website or the comments section of this blog post.
Many thanks to Ian Brown, Chris Phethean, Phil Waddell, Justin Murphy (University of Southampton), Tom Storrar, Suzy Espley (The National Archives), and the Internet Memory Foundation.