Today is the very first International Digital Preservation Day, organised by our colleagues at the Digital Preservation Coalition. The day is a celebration of all things related to Digital Preservation so this post celebrates something enormous – literally – that we achieved earlier this year.
Since 2003 The National Archives has been taking regular ‘snapshots’ of UK central government websites and making them available through The UK Government Web Archive (UKGWA). The web traffic generated by the UKGWA is several times greater than that of all other The National Archives web services put together, so we have always worked with external partners who have the expertise to manage web harvesting and replay on this scale.
Until July 2017 that partner was Internet Memory Research (IMR) (formerly known as the Internet Memory Foundation), one of the first organisations to specialise in web archiving. IMR built a dedicated data centre to be able to manage the traffic produced by us and their other clients, and to hold the volume of data harvested from the UK Government web estate. By 2015 this was approaching 100TB of data. What IMR had never been tasked with was long-term preservation of this data, which is the responsibility of The National Archives. Some way had to be found to copy this mammoth amount of data from IMR’s data centre in Paris to our site in Kew.
The data itself is stored in ARC files (then the standard format for web archive data), which are saved in 100MB chunks as they are generated during the harvesting of sites to be archived. We initially attempted to encrypt the files and transfer them directly over the Internet but we soon realised that this was an unreliable process: slow, and prone to errors requiring frequent restarts. A dedicated line would be faster, more reliable, and considerably more expensive. We decided that the fastest and most cost-effective method was to transfer the data on physical media using a courier. The National Archives’ standard medium for physical data transfer is currently 2TB USB-3 hard drives, so this is what we used.
Between 2015 and 2017 approximately 120TB was transferred in this way, using around 70 drives. Once it arrived at our Kew site the data was copied to tape for long-term bit level preservation. Tape, while convenient for preservation, is not suitable for rapid access to large amounts of the data, so the original hard drives used for data transfer were kept for day-to-day use.
This was the state of affairs when the UKGWA contract was retendered in 2016. In line with The National Archives’ (and UK Civil Service) policy of ‘cloud first’, our procurement requirements stated that the web archive must be hosted in the cloud. 1 Now we had to transfer that 120TB of data from Kew to the cloud; specifically, Amazon Cloud Services.
Our new partners MirrorWeb are experts at managing web archives in the cloud. They decided that the best way to do the transfer was using an Amazon Snowball. This is a completely self-contained computer with a huge amount of storage in a rugged case, so it can be transported by courier without packaging; even the delivery note is internal and displayed on the inbuilt screen. Any data transferred to the device is encrypted in transfer.
We had a goal of completely loading the data in a fortnight. Some rough calculations had shown MirrorWeb that, with eight USB drives feeding the Snowball in parallel, and the expected USB 3 transfer speeds, we would need two Snowballs. Our team had some previous experience using Snowballs, and knew that the fans could be too loud for an open plan office, so we arranged a lab just for the Snowballs. As it turned out the noise levels vary considerably by machine, and these two weren’t as loud as some.
The Snowball comes with a single 10GB network connection, so MirrorWeb purchased and configured two PCs customised for high speed input/output (I/O). Each would, in theory, take eight USB hard drives in parallel and pass the data straight through to the Snowball. In practice, they seemed to struggle with more than four or five drives at a time, eventually slowing to a crawl or killing the copy processes. It was hard to debug such a low-level problem: eventually (and rather experimentally) we decided to cut our losses and try again with a different version of Linux. We swapped from Ubuntu to CentOs, and from then on everything just worked: it was just a question of feeding the system with new hard drives on a regular basis. We were initially quite cautious about running at the full input rate of eight drives at a time, but eventually did reach it, and the transfer completed within the allotted time. What had taken us two years to transfer in was transferred out in two weeks.
From there the two Snowballs were sent off to Amazon, and within a matter of weeks the UKGWA was being served from the cloud.
We relaunched the UKGWA on 1 July 2017 without any break in service to the main web archive access. Alongside moving to the cloud we made some other improvements which we discussed in our earlier post. We are committed to continuous improvement throughout our contract with MirrorWeb.
Are you a UKGWA user? We would be very grateful if you could complete our user survey to give us your feedback. It will take around ten minutes to complete.
Thank you to Graham Seaman, Web Archive Collections Manager – who managed the data transfer process – for writing the bulk of this post.