Jan 24 : probability & start entropy
books
Discuss the course texts, all listed on the
resources page:
- Biggs "Codes: ..."
- Shannon's "Mathematical Theory of Communication"
- MacKay's "Information Theory ..."
- wikipedia: information entropy ... and other topics & references there
First is very math-ish; 2nd is great but of limited scope;
3rd is very good at times but wordy and mostly aimed at another topic.
This subject can get very technical (e.g. Cover & Thomas).
overview
- compression
- error correction
- crypto
We'll start with entropy (defining it, calculating it)
and then move to compression (huffman, LZW, etc).
python comments
I'll be using juypter python notebooks for some of the numerical work and plotting.
I would encourage you to explore this platform if you're not familiar with it -
quite nice for this sort of stuff.
And for some fancier data libraries, check out
this week - entropy & preliminary concepts
(The text doesn't really finish entropy until after it discusses huffman coding ...
I'll be doing things the other way around.)
Discuss these terms and ideas :
alphabet, string, message, word
code
uniquely decodable (UD)
prefix-free (PF)
optimal code & "average word length"
information entropy = sum( - p[i] log(p[i]) ), where p[i] = probability of symbol i
huffman code
source, probability, conditional probability
In particular, go over conditional probability.
Here's a tiny example :
Say you have 3 marbles:
(big red)
(big blue)
(small red)
Then what is
P(red) = ?
P(blue) = ?
P(red & big) = ?
P(red | big) = ?
P(big | red) = ?
How is P(big & red) related to P(big | red)
You'll need to understand this to do the homework due Thursday ...
so let's talk about what I'm expecting you to do.
intuition
In basic physics, entropy of a system is ln[number of states].
- why logarithm? Answer: we want it to add, for 2 systems, but number of states multiplies.
- what is it, really? Measures how likely a set of states with same macroscopic properties is.
- temperature as measure of entropy with energy
How does this connect with Shannon's entropy?
- First idea : if something can happen N ways, then
ln(N) = - ln(1/N) = - ln(p)
- Second idea: not *total* entropy, but entropy *per_symbol* . So we need to average.
- Definition of average: (make sure this is clear)
mean(x) = sum p(x) * x
H = - sum p_i * ln(p_i)
An example
Suppose you have this string of bits :
0 1 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 1 0 1
And suppose that you would like to calculate its entropy.
First issue: one specific string doesn't really have any entropy, any more than a given number can be random or not.
Aside:
What does it mean for a number to be random?
Answer:
The question misses the point. A process which chooses numbers can be random.
And we often then say that the numbers it produces are random. But numbers
themselves don't have the luxury of being random or not. So when someone
says "Pick a random number between 1 and 10", what they should be saying
is "Randomly pick a number between 1 and 10".
So now imagine that that string of bits is one example of a much longer stream of bits,
with one generated after another, left to right.
We can then use that example to try to find a model of a process that could generate those bits.
And it's that model that we use to generate a number for the entropy.
By "process" here I mean a probabilistic model of what bits (or words, if the
bits are chunked together) are generated with what probabilities. We'll use
the knowledge of the bits seen previously to hopefully get the best model
that we can.
In the formal languages course last term, we say another definition of "information"
of a string or set of strings, connected to to the minimal size of a Turing machine needed to generate it.
That definition was fundamentally satisfying but impossible to calculate.
Shannon's entropy is based on a probabilistic model of predicting the next bit in
the stream. If the bits were a digits of pi, that would be a really bad model,
since they look pretty random but in fact aren't. The notion of entropy
we're developing here would not be able to tell that digits of pi can
be calculated and are therefore predictable and therefore carry little information.
With all that in mind, let's do the best we can with this example to get a
model of the probabilities that it implies
... coming in class ...