‘Baffled by archives’ is a three-part blog series addressing the issues and challenges faced as we progress with Project Alpha, which was introduced in Building an archive for everyone. Part one considered why visitors to The National Archives on the web are confused. This part begins to look at ways of helping them out.
How do we reduce bafflement?
The feeling at The National Archives is that the bafflement would be reduced if the user’s mental model was aligned with how archives work; users would better understand what they are navigating, and why they can’t see what they expected to see, and could use that model to inform how they explore, search and interpret results.
But are we after a user experience that conveys that model, that gently trains the user in the ways of archives? Or are we looking for ways to mitigate the fact that many users don’t have the model when they arrive in the middle of a confusing information landscape? Are we looking to bring in additional or alternative means of exploration and discovery to make up for the lack of that mental model? Can we do both of these things: make it work for the model-savvy researcher, and the user who brings a different set of assumptions? We can start by exploring what tools and materials we have available to us and how to put them to better use.
The National Archives publishes a lot of material that attempts to bridge this model gap. Besides editorial content on how archives work in general, there are more than 300 published research guides that explain the records of (e.g.) the High Court of Admiralty or even UFOs. The research guides provide links to common starting points in the hierarchy, suggest search terms and query strategies, and offer pre-canned, partially-filled advanced searches. These guides can launch the user on their voyage – if the user reads them.
Research shows that many people aren’t aware of them, choose to ignore them, don’t use them, or don’t apply them once they have set off exploring. This is understandable: keeping an instruction manual open while you navigate a website is not a normal thing for web users to do in the 21st century, even if they are aware of the purpose and existence of a research guide relevant to their interest. There is also live chat, and telephone support. Making use of all these things requires the user to somehow recognise that they are baffled, that help is available, and that they can seek guidance in one or more of these forms.
The audience for archives is more committed and persistent than most, and is prepared to accept that their search work will be different and more open-ended. Often that exploration and unearthing against the odds is part of the user’s motivation in the first place. Discovery and exploration are inherent in the domain – it’s not a service answering a clearly defined goal.
In other domains the need for such comprehensive support would be an admission of design failure. That’s certainly not the case for archives.
However, there is still an assumption that search and browse are systems that should stick to existing conventions – that’s what an advanced search form looks like everywhere else, too, so if many users need a lot more educational material to realise the potential of the user interface in the archive context, then it’s the right thing to do to invest in the production of that educational material, and real-time human support services.
Is there an alternative? Instead of supporting the design of search and browse with copious educational material and services, can we design it differently, so that it doesn’t rely on this external support but provides the user what they need without it: different interactions that support and educate during search and browse interactions?
To help think about this, we’ll take a detour into the digital. But first a little terminology: the two lowest levels in The National Archives hierarchy are piece and item.
A piece is usually something you could order, or perhaps hold in your hand: ‘A piece can be a folder, file, volume or box of documents.’
An item is below this level (and many things are not described at the item level) and could be an individual page within a bound volume, or a file within a folder – generally something that lives alongside other items within a piece, and often could not be physically separated from the piece – you can’t order just one page of a ledger.
Digital and digitised
Digital can be used somewhat loosely in this context. It could mean archival descriptions that are available and navigable on the web (for example, available in The National Archives Discovery interface, an electronic catalogue). Or it could mean material that has also been digitised – you can actually see the content online, photographed, as well as its archival description. The former aids a prospective in-person visitor to the archives in understanding what is available, and in ordering material for viewing when they visit. The latter – digitised content – might mean that the user can do much more online, and explore more and varied content than they could in a planned visit. Many users do this, and you can facet to digitised material only in the Discovery interface. It is clear from user research that many visitors expect that they will see everything, that it is all digitised.
The affordances of digitised content are different from the affordances of paper and the physical presence of the material: seeing the true sizes of pieces and series and whole departments, and having a suspicion that this next box looks promising from a peek under the lid. The assumption that digitised means users will better understand the nature and extent of an archive and make sense of it is not necessarily a given. It may offer a false perspective – everything becomes a digital object in a viewer, pixels on a screen, levelled in some way.
An archive could be neither digital nor digitised – it could have material that is shelved but yet to be thoroughly described in an electronic system (even if there are finding aids they may be in paper form only). A visit to that archive could be very productive, and it may be the basis for years of research activity, but searching or browsing a web interface isn’t going to reveal much. This scenario is out of scope, so everything we talk about is digital in the first sense (i.e. has an electronic navigable description), but only sometimes digitised (i.e. you can see the content in your browser).
Even in the ideal case where everything that is described is also digitised, the user may be no less baffled than when trying to make sense of description on its own; the need for a mental model of organisation doesn’t go away just because you can now see the leaves the rest of the tree is leading to.
The piece Correspondence 1948–1952 now goes a level deeper, into a hundred-image sequence of the contents of that box; it doesn’t follow that individual intellectual objects are more clearly delineated or given their own findable identity as web resources that could feature in search results. However, it gives us more ways in which they could be delineated and analysed by human or machine processes, to enrich the archive and make things more findable.
Many National Archives records do go down to piece level, some to the item level, so when digitised there is a distinct intellectual creation (which might be an official form, for example) that is now visible, as well as described in summary.
Regardless of the possibilities afforded by some of the archive being digitised, it’s clear that any attempt to reduce bafflement cannot rely on techniques that are only available for digitised parts of collections. Not only because just a small proportion of archives is digitised, and we need to help everyone and not have radically different and non-transferable discovery experiences across collections, but also because a mixture of digitised and non-digitised collections can skew perceptions of the archive and introduce significant bias.
This operates on several levels. Something digitised at piece level might mean just that those hundred images of letters in Correspondence 1948–1952 are now visible online. That’s good, if the user can find them, but they are no easier to find just by adding pictures. The user experience at the piece level can be very different, but no new information has yet been created that would help a user arrive at that piece.
But now they are digitised, they could be described at a deeper level. Distinct parts could be identified, labelled, their extents within documents marked so they can be viewed individually and become findable web resources in their own right, at the item level and even below, into the digital object. Where individual letters were not described, there is now the means to do so. The letters could be transcribed. If they are typescript, it’s probably quite easy to transcribe them with machines. If they are handwritten, it’s rapidly becoming easier to do that with machines too, or they could be transcribed by humans.
Once the text is available, natural language processing techniques can be applied. Named entities can be recognised to generate tags automatically, perhaps from controlled vocabularies, aligning with authority files. Sentiment or meaning can be mined from the text. With recognised entities we get an explosion of discoverability; tagged items feed their tags into pages that aggregate by tag; the number of connections between objects explodes, the ability to follow trails and concepts in the content increases dramatically. This is the principle behind the generous interfaces approach.
From being invisible below the piece level of archival description, content enriched in this way is now just as potentially findable as if it were a regular web page; its text is available, it is summarised, its subject matter is marked by tags or keywords – this is now playing the game that search engines play.
Dangers of selective processing
There is a danger here. We are introducing a new form of selection pressure. There has already been selection pressure at work for the material to form an archive in the first place – deciding what is significant, that it should be kept and not discarded, deciding how much effort should be spent on describing it etc. This bias is inherent in the very nature of an archive. But now we introduce a new selection pressure, on top of that – the material’s suitability for the various human and automated processes we want to throw at it.
Printed material offers up more to machines than manuscript material; modern English material is better suited to entity extraction than older content in different languages. The way machine learning models have been trained influences the types of generous interface we can eventually produce from those processes. User interfaces driven by topics, people, geography, dates and so on may be unconsciously skewed.
Easily found material acquires a greater importance and influence than material that machines find harder to unlock. This problem is present when we use machine learning techniques on non-digitised content too, when we are processing the descriptive text and other metadata.
Perhaps this is unavoidable and the benefits of these approaches are so great that we acknowledge that they also introduce new bias and move on, but we must be mindful of it. We (Digirati) have applied precisely these techniques on the Indigenous Digital Archive project, generating an explorable online archive from a mass of thinly described sources. In our Science in the Making pilot for the Royal Society, human taggers and transcribers help generate more routes to the content; this archival content dispenses with its existing hierarchical organisation almost entirely, and hopes that the actions of enthusiastic taggers applying Library of Congress Subject Headings will provide organisation that assists everyone else.
The answer to bias issues might lie in the interpretative content, the research guides or their successors. Conveying in the user interface some notion of why some things are more findable than others. Human cataloguing, machine learning and crowdsourcing may all play a part in why some items float to the surface more than others, so we can look towards presenting some evidence of the processes driving findability.
Applying digital techniques across the whole archive
We can take some of these ideas explored in making generous interfaces from digitised archives (i.e., from their actual content) and apply them to the textual content of finding aids (descriptive text already in electronic form, in the catalogue), and any structured metadata and linking properties attached to those records. The National Archives acknowledge this goal – ‘the ability to link, group or retrieve objects of whatever kind (documents, web pages, people, places, events) in any number of meaningful contexts: by topic, by provenance, by named entity, by popularity, by similarity etc.’
This technique is applicable across the whole archive, not just the digitised material. Natural language processing can recognise entities mentioned in archival description; known authorities (people, places, etc) can be augmented by additional recognised authorities in the archival description text to generate new aggregations and forge new links from one item to another, and to external resources, offering more routes for the explorer of the archives to follow hunches and trails in the data.
We have the intellectual structure of the tree itself. We have the text of archival description. We have (sometimes) additional metadata. Where the content itself can be mined because it is digitised, we have more sources of text to apply these techniques to, working towards the hyperlinked archive. Cloud services offering face recognition, comprehension of semi-structured tabular data, and other services add more tools for increasing the knowledge that systems have about the content of the archive, and provide raw material for new kinds of user interface.
Can we make these work to support and grow the user’s mental model of the archives, rather than simply provide an alternative to it?
Both Science in the Making (SITM) and the Indigenous Digital Archive (IDA) are alternatives to the hierarchy; IDA because no such hierarchy exists and we’re trying to generate organisation, navigation and discovery experiences through a combination of natural language processing and crowdsourcing; SITM because it’s an experiment in alternative discovery interfaces for items that are already individually identified elsewhere in a traditional ISAD(G) description, and are amenable to crowdsourced tagging and transcription. These approaches only offer clues and suggestions. They are only beginnings. They sidestep the potential bafflement, the lack of a mental model of archives and replace it with alternative ways of exploring. This can’t be the whole story in a re-imagining of the archive; sidestepping is not the answer.
One approach may be to start to add these generous interfaces techniques, to augment the hierarchical description: offer alternative schemes beside the knowledge organisation. The Discovery application has some elements of this, for example an item tagged with ‘atomic bomb’ by a user; use of facets (although they only appear when the search is constrained to The National Archives’ own material). These are additions to the ways of exploring and following ideas: entity- or concept-driven cross navigation augmenting the structure provided by the hierarchy. These are devices that are familiar to users from many different domains, especially e-commerce.
Returning to the issue of bias, these devices are available in differing degrees for different material; is there a danger that providing these seemingly familiar mechanisms, the mental model of archives that might be essential to successful exploration doesn’t get established, or gets hidden? Are we seeking user experience solutions to convey and reinforce the mental model of archives, or user experience solutions that make up for the user’s possible lack of a model but then fail to take advantage of the existing intellectual structure? It’s not necessarily a bad thing to have multiple independent mechanisms for exploration, but can we make something better where these things are not independent, but reinforce each other? A possible answer is for the user interface to take a more active role in assisting the user. What might that mean?
In the next post we will explore some of these ideas.