How to move a 120 TB web archive to the cloud in two weeks

Building a resource as big as the UK Government Web Archive takes a lot of storage space. Comprising over 5,000 websites from 1996 to the present, as well as tweets and videos from government social media accounts, the data footprint of the archive in 2018 is over 120TB – far bigger than the average consumer hard-drive size of between 500GB and 1TB.

This creates challenges for the engineers who need to manage and manipulate that data – something MirrorWeb discovered in 2017, when we were awarded the contract to relaunch the archive as a more modern, cloud-based service.

In order to address these challenges and implement the version of the archive you see today, there were a couple of places we needed to think outside the box and even design our own tools and workflows to make the process as fast and effective as possible.

Why archive in the cloud?

First, though: why would MirrorWeb and The National Archives go to the trouble of moving the UK Government Web Archive to the cloud in the first place?

One of the reasons is the sheer volume of data itself. As an archive grows, it becomes less and less sustainable for an organisation like The National Archives – or its archiving partner – to keep investing in new infrastructure to accommodate this growth.

However, when you source virtual infrastructure from the likes of Amazon Web Services (AWS) – the cloud platform used by massive web brands such as Netflix and Airbnb – it becomes trivial to add new storage whenever you need it. (That’s not to say cloud providers don’t rely on physical hardware – they simply have the economies of scale to offer almost unlimited capacity.)

Cloud can also make web-based services faster and more reliable. Put in the simplest terms, physical hardware – like servers and hard drives – can be overloaded or fail. Cloud infrastructure, on the other-hand, tends to have a higher level of redundancy built-in – so if there’s ever a problem with a hard drive, server or even data centre, your services can simply be resumed elsewhere with minimal disruption.

Our first challenge: moving the data

However, much as we knew the UK Government Web Archive would benefit from moving into the cloud, we still needed to solve the problem of getting it there in the first place. For one thing, transferring 120TB of data from one place to another across a network can take a long time. What’s more, a transfer on this scale can be expensive (due to network costs) and even a potential security risk without the right security controls in place.

At that time, the archive was stored in a data centre in Paris (so it wasn’t difficult for us to get our hands on the data itself) across 72 USB-3 hard drives. We decided our best option for the transfer was to use devices called AWS Snowballs, which connect to your local network, copy and encrypt your data to internal hard drives, and can then be shipped to an AWS data centre for transfer into the cloud.

To speed things up, we also brought along two of our own custom-built PCs that would allow us to move data from up to sixteen of the USB-3 hard drives at a time.

With this ensemble of 72 hard drives, two custom-built PCs and two AWS Snowballs, the entire process took just two weeks – not bad for more than two decades of internet history!

Our second challenge: indexing the data

Of course, moving a national archive to the cloud is a waste of time if the data isn’t accessible to the researchers, students and members of the public who need to use it. This meant our next job was to build a public-facing website where visitors could view websites and social media content in the website in their original form, as well as search for content on specific topics.

The latter was a particular challenge for us. Search may seem like one of the most basic web technologies, but it can be complex to implement. This is because search engines don’t scan an entire set of documents one by one – they use indexes, which helps them return useful results much faster. MirrorWeb was tasked with writing a complete replacement for the UK Government Web Archives’ previous search functionality, which meant we needed to index 1.4 billion documents from scratch.

Once again, we ended up using a combination of our own and off-the-shelf technologies to do this.

We struggled to find an existing tool that would meet our specific need for indexing a large number of small files, so we built something called WarpPipe. This allowed us to index all The National Archives’ documents in just ten hours – far below the timeframe of six to eight weeks we were told it would take with one of the most popular big data processing tools.

The search functionality itself is provided by Elasticsearch, which substantially improves on The National Archives’ previous search engine in terms of speed, flexibility and reliability. We now update the index once a month rather than once a quarter, for example, so it’s much faster for new archive content to show up in search results.

The outcome

So what are the implications of this work for the archive’s users?

One benefit is that in moving to the cloud, we cut down on a lot of the overhead that comes with looking after your own infrastructure. We don’t need to work hard just to keep the lights on, which means the National Archives and MirrorWeb can focus on adding to and improving the features of the archive that our users care about. These range from simple user interface improvements to more advanced capabilities, like transferring large amounts of data for large-scale research projects.

It also opens doors to a lot of new opportunities for how we develop the UK Government Web Archive in the future. With all our content now stored in a modern, flexible cloud environment, it’s become much simpler to process this data – as we did with WarpPipe – and present it in a way that helps our users unearth and understand more about our shared digital heritage.

2 comments

The last 22 years of UK politics just became searchable onl… – Google Site Search says:

Tue 15 May 2018 at 9:29 am

[…] portal has been online since Netscape Navigator 2.0 was state of the art. Now, a National Archives project to make this trove of historical content more accessible has shifted 22 years worth of government […]

Tuesday briefing: AI experts urge Google to cease military development as staff resign in protest - ANDROID7UPDATE.COM says:

Wed 16 May 2018 at 8:00 am

[…] portal has been online given Netscape Navigator 2.0 was state of a art. Now, a National Archives project to make this trove of chronological calm some-more permitted has shifted 22 years value of […]