Over at Graham’s, talk has turned to randomness. I got remarkably long-winded, but I get so excited when I get to show off my rapidly fading math chops, and it was a chance to talk about Claude Shannon, a man who in his own way did as much to revolutionize twentieth century science as Einstein. Shannon worked at MIT with hypertext pioneer Vannevar Bush. His master’s thesis was a landmark piece of engineering theory (on circuit switching); his doctoral thesis was on theoretical genetics. During World War II, he worked on anti-aircraft detection; after the war, his publication of "Communication Theory of Secrecy Systems", marked the beginning of modern cryptography (in the civilian world, at least). His house was filled with toys: chess-playing computers, a Roman number calculator, juggling machines (while he was at it, Shannon had formulated a science of juggling), a machine whose sole purpose was to reach out with a skeletal arm to turn itself off. And as a Bell Labs researcher working on the question of noisy telephone lines, he had two insights that made him the father of information theory: communication is transmitting a message from one place to another, and information is how much needs to be transmitted. First, take any communication — a typed note, a telegraph, a speech over the radio — and reduce it to a string of binary digits or bits. In "A Mathematical Theory of Communications", Shannon noted that any message could be communicated if the source rate, the number of bits per second that needed to be transmitted, was less than the communication medium’s capacity for transmitting bits. Then he borrowed a concept from phyiscs: entropy, the amount of disorder in a system. Very loosely, the entropy of a message or language corresponds inversely with how easy it is to guess the next bit.

Take the simplest case, in which one is transmitting a string of and those numbers aren’t interrelated (this is known as a Bernoulli process). The entropy for the system (which has a number of possible values `i`) is:

-Σ

_{i}p(i) log_{2}p(i)

That is, over all x, the probability of state x followed by state y times the log-base-two of that probability. (Recall that the log of a number less than one is negative; the log of 1 is 0, meaning that a stream of a single digit, “11111…” and so on into the night has zero entropy.) This is less difficult than it might seem — if there are four possible digits I’m sending out, and each one occurs with a chance of 1 in 4, the calculations are quite simple (and if you don’t think so, just trust me):

Σ

_{(1…4)}-1 * (1/4 * log_{2}1/4) = 4 * -1 * (1/4 * -2) = -4 * -1/2 = 2.

So that message system has “two bits of entropy” — I’m transmitting two bits of information everytime I spit out another digit something out. That makes sense, as two binary digits are required to express four numbers: 00, 01, 10, 11.

Amazingly enough, the English language reduces to something like 1.5 bits of information per letter (meaning that English sentences are, from an informaton theory point of view, about 75% redundant), or less informative than my example of purely random choices from the four digits. For one thing, English is not evenly distributed; an "E" is more likely to come up than a "J". This property enables English text to compress hugely, as can observe by downloading a ZIP file of Tristram Shandy. A simple compression algorithm, Huffman encoding (see it animated!), relies on the fact that short binary strings can be to encode common letters.

Further, English, unlike coin flipping or random digits, is not a Bernoulli process; if a word begins with "R", its second letter will almost invariably be "H" or a vowel. The odds of getting a "Q" or "Z" are minimal. The simplest non-Bernoulli process is a Markov chain, in which the previous character determines the probability of the next character. It’s more complicated than that; as the Combinatorial Engine or the Shannonizer demonstrate, Markov chains aren’t a very sophisticated way of modelling English, but by taking in more and more information about context, you can narrow the possibilities of the upcoming letter with greater and greater accuracy. My entry to the 2001 5k competition demonstrated some highly unsophisticated compression relying on word frequency (look at the source code), although my brutal editing errors render the information on the page suspect.

In fact, error correction is one reason that all natural languages are so low-entropy. Redundancy is the ratio of the actual size of the message compared to the bits of information transmitted. We can compress down to meaninglessness; if in my coded message "QXIRE" and "QXRE" are both words that would fit in some context, a typo could be disastrous. Redundancy means that we’re not thrown for a loop about what a "refridgerator" is or who "Thms Edisn" was. In the Shannon-Weaver model of communication, every message has an encoder and a decoder, and to ensure transmission gets through over a noisy channel, the encoder can insert redundancy and let the decoder filter the message back out. That 75% isn’t just wasted noise; it’s what allows us to understand words over a noisy phone line or as we zip down the highway and see something out of the corner of our eye. Humans are pattern-recognizing animals; how else cn y rd ths?