How to automate web archiving quality assurance without a programmer

Computer servers and binary codeThere are times when you wish you had an assistant to do the boring stuff for you. Or you wish there was an app that could carry out a very specific and repetitive task that you sometimes have to perform at work. For many standard repetitive tasks you can use Microsoft Excel – but what if you have to manually do the following?

  1. Visit several thousand web pages
  2. Ensure that each page has been captured for web archiving
  3. Check that all the links on a single page work
  4. Check that all the buttons and menus work
  5. Make sure all the archived pages look identical to their live copies

This was exactly what we did in the Web Archiving team as one of the strategies for checking the fidelity of our archived copy of the EUR-Lex website. While our other strategies focused on checking that we have captured content, this was concerned with access to and renderability of the archived content.

EUR-Lex is the official website of European Union legislation and case law, and other public EU documents, including the Official Journal of the European Union. It is published in the 24 official languages of the EU. In order to ensure that UK citizens continue to have access to EU laws as they applied in the UK at the point of exit, the Queen’s Printer has been given new duties to publish relevant EU law under the European Union (Withdrawal) Act 2018.

We are meeting these obligations through publishing maintained versions of EU legislation on legislation.gov.uk, while also retaining a ‘snapshot’ of all relevant documents as they appeared on EUR-Lex on exit day, on the UK Government Web Archive. The archived version is important in terms of ensuring that the full body of EU law is captured, in multiple languages, and also provides an important evidential basis for our publication. Our supplier MirrorWeb uses a range of techniques to capture this vast volume of multilingual content at very high fidelity.

In a situation like this you can either test a small sample and hope for best, or spend six months on the task and test all the thousands of web pages. What you don’t want to do is to write a programme that does this for you, because it’s best if this task is done manually with human judgement. In any case, you might not even have a programmer in your team!

Fortunately, there is a third solution: break down tasks into smaller micro-tasks and then use browser extensions, keyboard shortcuts, or any other basic tool (such as Notepad) to automate smaller tasks. Before I delve deeper into this, please remember two things. First, almost always there’s an easier way to do something. Second, almost always there’s a free app or browser extension that does exactly what you want.

So, here is my approach to the task above. Instead of carrying out five checks on every web page and then going to the next web page, all the way to the 2,000th webpage, you’ll need to carry out one test at a time on all of the 2,000 webpages. Then, the second test on all of the 2,000 webpages; all the way to the fifth test.

Breaking down one job into five tasks is the first step. The second step is to figure out which tools can help you with tasks one to five. After a bit of research, I found the following:

  1. You can open thousands of web pages using a Chrome extension called Open Multiple URLs. With this you can open many URLs at a time with a couple of clicks.
  2. Use a Chrome extension called Revolver to check if the pages have been captured. I click once and Revolver switches from the first tab to the last as slowly or as fast as I want.
  3. To check if all the links on every single page work, use a Chrome extension called Check My Links. While Revolver switches between tabs, Check My Links looks for broken links on the page. Broken links are then highlighted in red.
  4. Checking if all the buttons and menus work was a tricky one! What I needed was to click on a specific menu item while Revolver switched between tabs. Unfortunately no browser extension could achieve that but after a bit of searching I found a little free automation programme called xDoTool that can click for you while you sit back and look at the cursor.
  5. Do all the archived pages look identical to their live copies? I achieved this using Excel for putting the live and archived URLs in two columns and then pasting them in the right order into the Open Multiple URLs Chrome extension, and switching between archived page and live page on hundreds of pages using Revolver.

Although setting up this process took several days and a lot of trial and error, once it was up and running I was able to make manual quality assurance significantly faster and more satisfying.

3 comments

  1. […] Here’s my blog post on The National Archives’ website on automation without coding using… […]

  2. Chris Doyle says:

    This is a great article. Semi-automated QA with the human touch. Brilliant :)

    1. Kourosh says:

      “Semi-automated QA with the human touch” would make a great title!

Leave a comment

Your email address will not be published. Required fields are marked *

We will not be able to respond to personal family history research questions on this platform.
See our moderation policy for more details.