Part 1 of this blog introduced Alan Turing’s paper on ‘The Applications of Probability to Cryptography’, explained the Vigenère cipher, and ended with some intuition about how to crack the simpler Caeser cipher using letter frequencies. This blog will combine these ideas with basic probability to demonstrate how we can crack what was once considered an indecipherable encryption method with a pencil, paper and scissors.
I’ll first introduce Reverend Thomas Bayes (c.1701-1761). In addition to being a Presbyterian minister, Bayes was an amateur mathematician, elected to the Royal Society in 1742. (You can see his letter of invitation and other items in this online exhibition curated by the Royal Society). Bayes is most famous for his paper, ‘An Essay towards solving a Problem in the Doctrine of Chances’, which was published posthumously after his death (you can find his will in our collection).
Despite the fact that his proposed theorem was very controversial among mathematicians for more than 250 years, Bayes’ Theorem is now a powerful tool used in all manner of areas. Its power is in being able to be applied to diverse problems such as identifying the prevalence of waterworlds in the universe, quantifying the effectiveness of a vaccine, and, of course, code cracking.
The general principle of Bayesian probability is that you start with a prior probability of some hypothesis (this is the controversial part because it can be subjective and doesn’t involve the data being analysed), and then you update it with data, or evidence, to get a posterior probability of the hypothesis.
Before publishing his book ‘A Brief History of Time’, Stephen Hawking was told that each formula he included would reduce his sale by half. As a Bayesian, I would therefore start with a prior probability that half of my readers will stop when they reach the first formula. I could then survey readers (I won’t!) to see if they got past about half way down this blog. I would then combine my prior probability with the data, and calculate my updated posterior probability. Hopefully this will show that readers of The National Archives’ blog are made of sterner stuff.
In Part 1 we looked at the Caeser cipher, which worked by shifting all of the letters of a message by a set amount, and the Vigenère cipher is made of several of these.
Figure 1 shows a 9 x 10 grid of letters. These are a reshuffling of a 90-character message which begins ‘D K Q H S H Z N M P R C V X U H T E A Q X H P U E P P S B K T W U J A G D Y O J’.
Turing created the columns of this table by putting letters 1, 11, 21, 31, 41 etc. into column 1; letters 2, 12, 22, 32, etc. into column 2, and so on.
Now we have 10 Caesar ciphers to solve, one for each column. This is an extraordinary 141,167,095,653,376 (2610) possible key combinations to try in a brute force attack.
To solve the Caeser cipher we look at frequent letters and infer that they are Es, As, Ss, or Ts. Then use a linguistic approach, like a crossword, to make other words, until we have enough to know the key value. The same approach can work here. For each column we will find the most likely shift value using probability as our guide.
Turing defines probability as follows: ‘The probability of an event on certain evidence is the proportion of cases in which that event may be expected to happen given that evidence.’ According to Turing’s statistical analysis of English language from Part 1, if you picked a letter at random from a piece of text there is a 116 in 1000 chance of picking an E, but only a 2 in 1000 chance of a Q or a J. This will be obvious to anyone who has played Scrabble. The rules of probability mean that if I were to pick two letters the probabilities are multiplied. So the chances of picking two Qs is just 4 in 1,000,000.
With this in mind, which of these sets of letters seems most likely to come out?
a) ‘D R X T T P B W T’
b) ‘C Q W S S O A U S’
c) ‘O C I E E A M H E’
If you answered c), you are right. There are six vowels, including three Es, and the other letters are relatively common. Answer a) has no vowels and an X, while b) includes a C, W, and Q, good scorers in Scrabble, as well as three Ss.
The sets of letters are the result of shifting the letters in column 1 backwards by 0, 1 and 15 positions, respectively. The best (and correct) answer equates to a cipher key of ‘P’, and while the other two are not necessarily wrong they are less likely. By starting with ‘P’ in column 1 we’ve already reduced the number of combinations to try to 269, a saving of approximately 136,000,000,000,000!
Turing uses a formulation of Bayes’ theorem based on odds. The odds of rolling a six (on a standard die) are ‘5 to 1 against’ because there is one of way of getting a six, and five ways of not getting a six. We start with a theory that the encryption key for a column is, say, ‘B’. Without any knowledge of how encryption keys are chosen, the prior odds are always the same: 1/25 (one way to choose ‘B’, 25 ways to choose another letter).
We then gather some evidence, in this case the encrypted letters in the column, and calculate the probability (shortened to ‘P’ in the formula below) of seeing that evidence in the cases of the theory being true or false. This gives use odds in favour of the theory which when multiplied by the prior odds gives us the posterior odds of our theory.
The first letter of column 1 is ‘D’ and if the key is ‘B’ this was originally a ‘C’. By Turing’s analysis a ‘C’ has a probability of 0.021 (21 in a 1000). If the key is not ‘B’ then the original letter was anything but a ‘C’. By the laws of probability the probability of all events must sum to 1, which means the probability of ‘not C’ is 1-0.021. There are 25 alternative key values so we divide this number by 25. Since this is dividing the bottom term, the 25 moves above the line in the ratio. This makes the odds of the evidence equal to:
This calculation can be performed once per letter, based on the values in the frequency table, and stored in a table. Then for each column we shift the letters one value at a time, and for each resulting set of letters multiply the odds values of each letter together. The larger the result, the more likely the key.
Multiplication without a calculator is slow when working at great speed and scale. There is one more tool in the box to simplify things. The logarithm, or log, is the reverse of ‘power’, i.e. the log (base 10, to be precise) of 10 to the power of 3 (10 cubed) is 3.
These days the log function can be found on a calculator, but in Turing’s day the log of a number was found by looking it up in tables of numbers. Anyone who went to school earlier than the 1990s may be having nervous flashbacks right now!
There are two behaviours that make logarithms a useful tool for this problem. Firstly, they shrink large numbers – notice how as powers of 10 grow (10, 100, 1000, 10000, …) their logs grow much more slowly (1, 2, 3, 4, …). Secondly, the log of two numbers which are multiplied together is the same as adding the log of those numbers. For example, log(10 x 100) = log(1000) = 3, while log(10) + log(100) = 1 + 2 = 3.
Turing actually came up with his own unit, the deciban, which he describes as a more convenient unit than the ban. He doesn’t tell us what a ban is but does say it was named in honour of the town of Banbury. He does explain the deciban though: ‘A deciban is a unit of evidence; a piece of evidence is worth a deciban if it increases the odds of the theory in the ratio 101/10 : 1’ (roughly 1.25 : 1). He actually uses ‘half-decibans’ as they apparently halved the effort needed in practice. A bit of mathematical manipulation gives us a simple conversion formula, so a number in half-decibans is just 20 times the log of the number.
So the odds of ‘C’ in half-decibans: 20 x log(0.53) = -5.41.
The decimal is insignificant for our purposes and dropping it saves effort, so ‘C’ is worth -5 half-decibans.
Now instead of multiplying odds we can add half-decibans stored in a table.
Turing went further and calculated multiples (he calculates up to ‘5 times’), so if, for example our text contained three Cs, we would look in the ‘3 times’ column.
The rounding happens after the multiplication, so the answer is -16 (3 x -5.41 = -16.12), not -15. Part of this table is shown in figure 2.
For column 1, key ‘B’ gives the following scores which add up to -12:
Similarly the scores if the key is ‘P’ (the right answer) add up to 44 (or 43 if you use the 3x column value, 29, for ‘E’):
Finally, Turing reveals a highly practical code cracking tool. To use the grid of scores efficiently he suggests manufacturing an apparatus (out of card) which consists of a similar grid layout (26 rows, 5 columns). You will need one per column of cipher text. For column 1 (‘D R X T T P B W T’) he cuts holes in the first column from the right of rows D, R, X, P, B, W, and another whole in column 3 (from the right) for row T.
We then methodically slide this card down the grid of numbers one row at a time. For each row, which represents the cipher key, we add up the numbers in the windows. It is really important to remember that as the card slides down we are moving backwards through the alphabet, because we’re decrypting the key. After 11 moves we are at, 26-11, letter 15 of the alphabet, which is ‘P’. It takes 25 moves to get to ‘B’. The highest total gives us the most likely key.
In practice there is no need to add them all up. When the key is ‘B’ there are four negative numbers including a -26, and nothing positive above 7. Experience shows that is unlikely to be a winner. For letter ‘P’ there is a 29 and only two small negatives, so we know without doing the sums that this is high scoring.
To recreate the original text we now reverse each letter in the column by the appropriate number of step, in this case 15 (P is the 16th letter of the alphabet). In some cases there will not be a single clear winner, but we will have two or three candidates to choose from. Then we just use trial and error to see which creates the most likely message. It is still a lot better than trying all 26 letters!
Try it yourself
If you would like to try out the code cracking tool we have included two work sheets for you to print out: Letter scoring worksheet and Letter window worksheet. To find the key for a column, simply cut holes in the sheet without numbers for each letter in that column (remember to use the correct column if a letter appears multiple times), then move it down the sheet with numbers and tot up the values that show through the holes. To save paper, a sheet of transparent acetate would be even better!
Hopefully this blog has given you an insight into how the relatively simple mathematical tools were used to help crack codes in the Second World War. Enigma was far more complex than Vigenère, but these tools helped reduce the size of the task and give the people of Bletchley Park a fighting chance of cracking the codes.
Simpler style to introduce to children is the military ‘Five Unit Murray Code’. Their eyes light up and this leads them on to more complex variations.
This Turing paper (https://arxiv.org/pdf/1505.04714.pdf) is very interesting in an historical perspective, but it lacks references.
When it says “Nearly all applications of probability to cryptography depend on the factor
principle (or Bayes’ Theorem).” is it common knowledge at that time or is it a finding of Turing that is demonstrated by the remaining part of the paper ?