Spiders on the web: Archiving UK Government websites

The UK Government Web Archive is one of the world’s largest and most used. Not only do we take regular snapshots of websites to guard against contemporary government information being permanently lost, we also use the collection to help our colleagues across government in their efforts to keep historical information truly accessible through the principle of Web Continuity.

Central to this is having an effective and consistent method for capturing this content.

There are different approaches to web archiving, each of them with their own advantages. At The National Archives, we use a method known as ‘remote harvesting’, performed for us by our contractors, the not-for-profit Internet Memory Foundation. By using software rather lovingly known as a ‘crawler’ or ‘spider’, we capture content as it appears on the public web, so that future users can see not only the information the government puts on the web, but also the context in which it was presented. This can have important implications for the message a website is trying to convey, as illustrated above by an example from a themed collection, taken during the height of 2010’s volcanic ash cloud crisis.

A crawler is a little like a human web user interacting with a series of webpages. Much as you may have reached this page by clicking on a few links, the crawler will identify links on a page and follow them, but, crucially, it captures all the data it finds along the way to ‘freeze’ as it appears at the time. Each page or file on a website is likely to contain a series of links, each link being unique at that point in time, pointing to yet more pages and files, and so on.

Of course, this process works a lot faster than any human could – many crawls we perform are done at a rate of two of these interactions per second! The crawl of a given website stops when it runs out of new links to follow and, finally, a navigable, static version of the website is added to the web archive.

This method gives us a way of capturing large amounts of information quickly and cost-effectively. In fact, through this method, we have accumulated more than 1 billion web pages and documents from our part of the web.

Being the permanent custodians of government information, we must ensure that we do everything possible to capture all the information the government presents on the web. We do a thorough check over the resulting archived website to make sure that everything is as it should be, using both automatic tools and manual techniques. When the crawler doesn’t identify a link correctly (for example, because it’s contained within a part of the page which is impenetrable to it), we often have a chance to fix the problem.

This is necessary in no small part due to the fact that, while there have been great advances in the capabilities of crawlers to capture increasingly complex web content, the web’s furious pace of change means that the tools are necessarily reactive and have to be honed through continuous development.

But why don’t we just take back-up copies of a website from a file system, rather than doing this seemingly more complicated process? Well, in fact, reconstructing a website from its component parts, piece-by-piece, is infinitely more complex and labour-intensive. Add to that the scale of the task we undertake (about 750 websites captured on a regular basis) and the challenge becomes enormous.

Our method makes sure that we can capture content quickly so we are able to archive web content before it risks being lost forever.

Spiders on the web: Archiving UK Government websites

Tags

Leave a comment Cancel reply

Find out more

Websites

Site help

Legal

Follow us