SD – Ahead of Hack on the Record held at The National Archives back in March – the results of which you can see on our Labs website – I discussed with colleagues in the Advice and Records Knowledge department the possibility of pitching interesting and appropriate documents or record series to the developers attending the event. One suggestion regarded the catalogue data for BT 31, a series which contains the files of dissolved companies.
The above graph is one product of the work done by Matthew Pearce, from The National Archives’ Standards team, who attended the Hack Day as a developer, and felt inclined to work with the BT 31 data. The information – following a recent cataloguing project – is fairly substantive, including companies’ names and years of incorporation, which provided Matthew the opportunity to create various representations of the data. Matthew will now take up the story.
MP – As you can see from the company records chart above, the bulk of the records in BT 31 are from the period 1860-1930, after which the data will be found elsewhere. In total the dataset contains information on over 180,000 companies, with in excess of 600,000 different words used in their names.
That might be small beer on the internet, but it precludes finding interesting trends by eye. The question then is what techniques to use to find unusual events. Clearly we can count up the numbers of times a keyword appears. This, however, captures volume of use alone, not events.
A second method is to look at the behaviour of a keyword over time, working on the premise that noteworthy historical events will cause peaks and troughs in the use of keywords. To give the reader some intuition about what this means: the following chart shows the number of companies with ‘Electric’ in their name divided by the total number of keywords in that year.
Thomas Edison invented an improved filament for the electric light bulb in 1880. What’s interesting, looking at the timeline of other developments in the history of the light bulb, is that the companies data reveals which stage was commercially decisive in opening up the market.
Now we can identify other such unusual behaviour through use of information entropy. How unpredictable a keyword is. Essentially, more peaks make for a lower information entropy as it is easier to guess where the mass of keyword use is. A flat graph makes for a high entropy, because the mass is evenly spread out. Using a metric based on this idea, we can generate a list of keywords that might be interesting. Here’s the top ten from one such metric:
0 – Cycle
1 – Picture
2 – Syndicate
3 – Exploration
4 – Coffee
5 – Cinema
6 – Transport
7 – Rubber
8 – Motors
9 – African
Taking ‘Cycle’, we can see that there was something of a boom in the industry between 1890 and 1900.
Again innovation was a contributing factor: ‘In 1888, Scotsman John Boyd Dunlop introduced the first practical pneumatic tire, which soon became universal’, along with some other developments you can read about on Wikipedia. Again this is interesting not just for the development of the technology, but for the point at which it became ready for a mass market, with companies stepping up to create and supply a bicycle craze in the 1890s.
One blog post is far too small a space for all the stories waiting in this dataset. ‘Gold’ yields a tale of booms in Australia, America and Africa; ‘picture’: the birth of an entertainment industry; ‘motor’ and ‘rubber’ together track the development of modern transportation and the inustries that supply it. I just hoped to outline here one algorithmic approach to history using the official record as data.
SD – What I find interesting about Matthew’s work on the subject is that it demonstrates one of the main outcomes one hopes for from an event like the Hack Day, showing that new stories and insights into history can be uncovered when technical or mathematical skills are used. Furthermore, by using catalogue data to show the records in a new light he is in a sense mirroring what many of our other researchers do here at The National Archives: using a set of information that was created for one – usually administrative – reason for a different research purpose.
Also, as Matthew suggests, one blog post is not enough space to extrapolate on his work, so we hope to update you on developments in the future. Furthermore, we are interested in any readers who may have conducted similar research. If so, please tell us of your experiences below.