Quantifying digital preservation risks using statistics

Following on from the blog post last summer, The National Archives has been working with statisticians from the University of Warwick and partners from five other UK archives to develop a model that quantifies the risks to digital preservation. This has been done using a Bayesian network – an advanced statistical technique that considers the dependencies between risk events, e.g. the risk of hardware obsolescence depends on the storage medium used.

We now have a complete model and prototype web-based tool for digital archivists to use.

Creating the model

To build the model, we first had to identify the key risks to digital preservation and how they related to one another. This began as a brainstorming exercise with lots of sticky notes on walls and ended up as a mathematical network. Through discussions on scope, debates about semantics and careful logical thought, this network evolved and finally settled into what we have today:

A network diagram for the Bayesian network behind the DiAGRAM network. The nodes are linked to show how different risks influence each other, leading down to nodes for Renderability and Intellectual_Control — Caption: This image shows the full Bayesian Network of digital preservation risks modelled for this project. Bayesian Networks are statistical models which accommodate the complex relationships between variables of a system (nodes). The arrows show direction of influence. This image is copyright the University of Warwick.

Each of these 21 variables (or nodes) has a very precise definition in the model. For instance, ‘Tools_to_Render’ is defined as ‘Availability of tools and software to render the digital material and the expertise to use them’ and has two states it can take: ‘yes’ and ‘no’. We know that in reality most institutions will not be perfect but they might be doing something, therefore all of the nodes are given as probabilities, to reflect that there is a scale between the extremes of being very vulnerable to this threat and fully mitigated against it.

Most of the probabilities used in the foundations of the model came from a structured expert judgement workshop. This involved experts quantifying uncertainty of variables relating to digital preservation risks, then results were carefully combined to maximize the statistical accuracy and informativeness of the final data. A blog post explaining the protocol that was followed can be found here: Who to Trust? The IDEA Protocol for Structured Expert Elicitation.

A range graph showing the values given by 19 experts in answer to the question given in the caption. There is considerable variation in the ranges used and the confidence expressed by each expert indicated by the breadth of the range. — Caption: A range graph displaying the experts’ initial estimates to question 36: ‘Out of 1,000 born-digital files, for how many would you expect an archive to know their conditions of use?’. Each horizontal line represents the results from a different individual. For each of these horizontal lines, the leftmost point represents the individual’s estimate for the fifth percentile and the rightmost point the 95th percentile. The short vertical line in between these end points represents the individual’s estimates for the 50th percentile i.e. the median. This image is copyright the University of Warwick.

Making it usable

From the start, we had to consider how to produce something which archivists would find usable and would want to use – all the hard work that went into building the complex model would be pointless if it only ever exists in statistical code. We therefore decided to build an application which takes the user through a series of steps to help them quantify the risk for their archive, and test the impact that different policies may have.

Fortunately our colleagues at Warwick had two placement students from Monash University, Australia, who were able to design and build a Graphical User Interface for the tool (and luckily managed to return to the other side of the world just before COVID-19 hit the UK!). Since then, we have made a few changes based on feedback, and we are hoping to give the tool a final makeover at the end of the summer before we officially launch the first version.

A screenshot from a webpage. The image is made up of three panels, top left is slider for entering a percentage value for how good the archivists technical skills are, set to 89%. Lower left is a text box labelled "Enter policy name", this has had the string "Policy 3" entered, there is a button below labelled "Add policy". On the right is a bar chart with four bars. Each bar has an orange part (representing the score for intellectual control), and a purple part (representing the score for renderability).

The bars are shown in descending order of overall score from left to right. Leftmost is "Policy 1" with intellectual control of 0.33 and renderability 0.65, then "Policy 3" scoring 0.33 and 0.48, "My model" scoring 0.33 and 0.39 and finally "Policy 2" scoring 0.47 and 0.21. — Caption: A snapshot of DiAGRAM’s user interface. The left of this image shows the user setting a policy where the level of technical skills is 89%. The column graph on the right shows the risk scores for the user’s model and three different policies. The purple section of the column represents the probability of renderability, and the orange is the probability of having intellectual control. The higher the total, the lower the risk. This image is copyright the University of Warwick.

The main output from the model is a digital preservation risk score which is currently calculated based on the probability of having renderability and intellectual control, the two digital preservation outcomes as identified by the project team. As with all terms used in the model, renderability and intellectual control have very particular definitions. Every model and scenario will produce a risk score, which archivists can use to quantitatively compare the impacts changing policies will have on their risk.

Three boxes, two at the top each with an arrow pointing to the lower box below. The box top left is labelled "Probability of the material being renderable", the box top right is labelled "Probability of the archivist having full intellectual control". The bottom box is labelled "Risk Score" — Caption: A graphic showing that the risk score is a function of the probability of having the two digital preservation outcomes – renderability and intellectual control.

For the next few months we are going to be working on our own metadata for the project, to ensure we have comprehensive guidance on how to use the tool and on the methodologies and data behind it. We are also holding virtual workshops with the Digital Preservation Coalition (the material from the first workshop can be found on the event page) and will be holding more online presentations and webinars once the final model is complete.

This project is supported by the National Lottery Heritage Fund and the Engineering and Physical Sciences Research Council.

To find out more, see the project’s web page, Safeguarding the nation’s digital memory.

Quantifying digital preservation risks using statistics

Creating the model

Making it usable

Tags

Leave a comment Cancel reply