An introduction to a few big ideas in information theory including
Jim Mahoney | Dec 2019
He defined "information entropy" as
$$ S = \sum_i p_i \log 1/p_i $$in units are "bits of information per bit".
What the heck is all that? Well, I'm glad you asked.
Suppose I send you a sequence of symbols, and you try to guess the next one in the sequence.
Here are some examples.
Shannon's idea was that the easier it is to guess the symbol, then the less information that symbol conveys.
In other words, how surprised you are by a symbol is how much information it has.
(Why are we talking about probability all of a sudden?)
Suppose the sequence is made up of pairs of bits, (0,0) or (1, 1) each with 50/50 chance?
(Would you ever want to send a sequence like this?)
Entropy explains why systems move towards an equilibrium, from less likely to more likely
BEFORE : less likely (as we'll see in a minute)
AFTER : more likely
If the stuff on the left stays on the left, and the stuff on the right stays on the right, then ...
If all the configurations of circles are equally likely, then they are more likely to be spread evenly between the two sides.
But if we want to think of $\Omega$ as a physical property, we have a problem.
We never find a combined total amount by multiplying amounts for each piece.
So ... use a logarhythm to turn multiplication into addition. 🤨
Entropy is not $\Omega$ but $ \log(\Omega) $ .
The guy who invented the statistical version of entropy is Ludwif Boltzmann, who even wanted the definition carved on his tombstone.
$$ S = k_B \ln \Omega $$Let's look a little closer at this log stuff.
So ...
For each of those conceptual marbles, we turn the physics notion of "which state is this in" into the information theory idea of "which symbol are we getting now".
So we turn the number of states into a probability.
Consider $\Omega = 4$ for the picture with one of those balls in one of four states.
The probability of finding it in a one of those four is $ p = 1/4$ .
$$ \log \Omega = \log \frac{1}{p} $$... which is almost Shannon's information entropy formula.
The last step in connecting physics entropy to information entropy is to see that the physics entropy is defined for the whole system, while Shannon's information entropy is defined per symbol in the sequence.
So if we have many symbols $i$ with different probabilities $p_i$, each with an entropy $\log 1/p_i$, the last step is to just average them to get
Yes, for any $f(x)$, the average of f(x) is $\sum p(x) f(x)$.
To see this, lets take the average of these numbers.
$$ (1, 1, 1, 5, 5, 10) $$The average is often written as
$$ \frac{ 1 + 1 + 1 + 5 + 5 + 10 }{6} $$But that can also be written in terms of how many of each we have
$$ \frac{3 * 1 + 2 * 5 + 1 * 10}{6} = \frac{3}{6} * 1 + \frac{2}{6} * 5 + \frac{1}{6} * 10 $$Or in other words,
$$ \text{average}(x) = \sum p(x) * x $$where $p(x)$ is the probability of x.
For binary data of 0's and 1's, if each bit arriving doesn't depend on the previous ones, then our model of the data is just the probability of each. If we let $P_0$ = (probability of 0), then
$$ P_0 + P_1 = 1 $$Our formula for information entropy is then just
$$ S = P_0 * \log2 \frac{1}{P_0} + P_1 * \log2 \frac{1}{P_1} $$or
$$ S = P_0 * \log2 \frac{1}{P_0} + (1 - P_0) * \log2 \frac{1}{ 1 - P_0 } $$And this is something we can plot.
The entropy is highest at "1 bit of information per 1 bit of data" at $P_0 = P_1 = 0.5$ . Anything else has less information per bit.
plot_entropy() # .. and who doesn't like plots in xkcd style?
OK, time for a somewhat more realistic example.
This week I've been reading "The Wandering Inn" (wanderinginn.com). Here's the first paragraph.
The inn was dark and empty. It stood, silent,
on the grassy hilltop, the ruins of other structures
around it. Rot and age had brought low other buildings;
the weather and wildlife had reduced stone foundations
to rubble and stout wooden walls to a few rotten pieces
of timber mixed with the ground. But the inn still stood.
Can we find the information entropy of that? And what will that mean?
# Using just the probabilities of individual characters,
# without looking for longer patterns the analysis looks like this.
text = 'The inn was dark and empty. It stood, silent, ' + \
'on the grassy hilltop, the ruins of other structures ' + \
'around it. Rot and age had brought low other buildings; ' + \
'the weather and wildlife had reduced stone foundations ' + \
'to rubble and stout wooden walls to a few rotten pieces ' + \
'of timber mixed with the ground. But the inn still stood.'
count = {} # {character: count} i.e. {'T':34, ...}
for character in text:
count[character] = 1 + count.get(character, 0)
n_characters = sum(list(count.values()))
probability = {} # {character: probability} i.e. {'T': 0.04, ...}
for character in count:
probability[character] = count[character] / n_characters
entropy = 0
for p in probability.values():
entropy += p * log2(1/p) # information entropy formula !
print('entropy is ', entropy)
entropy is 4.166746762972305
Turns out there's a clever way to construct such a scheme:
This scheme gives us a code which has the property that no character is the start of some other character, which means we don't need anything extra to tell characters apart.
Suppose we have four symbols (00, 01, 10, 11) with probabilities (0.6, 0.2, 0.1, 0.1).
symbol code
00 1
01 01
11 001
10 0000
This is the start of information theory, but of course there's still lots more good stuff :