Feb 10
First finish discussion of homework and coding lossless compression.
This site has some nice C code for this stuff :
One issue that Alex and I discussed for run length encoding : what to use as an escape char? (One answer: any repeated char; next byte is 2-255 run length or 0xx for >255 repetitions.)
noise
Read chap 8 and 9 in
MacCay's book
and chap5 in Bigg's textbook. (There are also some notes of mine attached,
from last time I did this course, which may be helpful.)
Also good (in no particular order) are these summary articles :
an example
Consider transmission of data through a "channel" :
transmit X --> (add noise) ---> receive Y
where X and Y are the sets of symbols,
X = {x1, x2, x3, ....}
Y = {y1, y2, y3, ....}
To be specific, I'll use a 3 symbol example :
X = {1, 2, 3}
Y = {a, b, c}
The idea is that we're going to code [a,b,c] as [1,2,3].
If there were no noise, that would be the end of the story.
But in the presence of noise, there is a probability
that even though we send "1", what arrives isn't "a".
We can describe the probability of an error by the
conditional probabilities. For example,
P(a|1) = 1 - f P(a|2) = f/2 P(a|3) = f/3
P(b|1) = f P(b|2) = 1-f P(b|3) = 2f/3
P(c|1) = 0 P(c|2) = f/2 P(c|3) = 1-f
where f is some small constant, e.g. f=0.02.
Quick quiz: what is P(1|a)?
Now the questions things like:
- How much information (bits of info per bits of data) does this noise cost us?
- What sorts of coding schemes should we use to maximize the information throughput?
- How can the "entropy" concept be extended to explain "information flow" in this case?
This sort of stuff has applications in all sorts of situations: losses for a given signal to noise ratio, image recognition with noisy sensors, data transmission from planetary space probes, ...
And the concepts are exceptionally cool.
definitions and claims
First make sure that joint and conditional entropy
of two variables is understood :
Given two random variables X and Y,
with joint probability distribution P(x,y),
i.e. P(x,y) = probability of x and y.
The usual probability properties apply :
P(x) = sum over y P(x,y)
P(x|y) = P(x,y)/P(y) <=> P(x,y) = P(y)*P(x|y)
Then
H(X,Y) = joint entropy
= - sum P(x,y) log[ P(x,y) ]
with properties
H(X) + H(Y) >= H(X,Y) >= max(H(X), H(Y))
the joint entropy is equal to the sum iff X,Y are independent.
The conditional entropy is defined as
H(X|Y) = - sum P(x,y) log[ P(y|x) ]
= H(X,Y) - H(Y)
= entropy of X given Y
and the mutual information is
I(X;Y) = H(X) + H(Y) - H(X,Y)
much of which is summed up in a nice Venn diagram in the
wikipedia articles and MacKay's book :
--------------------------------------------------- H(X,Y)
--------------------------- H(X)
------------------ H(X|Y)
--------- I(X;Y)
------------------------ H(Y|X)
--------------------------------- H(Y)
Summary :
H(X,Y) is info (uncertainty) in both
H(X|Y) is info (uncertainty) remaining in X after Y is known
I(X;Y) is what is known about X given Y (or vice versa)
In this noisy channel situation, we
take the variables to be
transmit X --> (add noise) ---> receive Y
And the probabilities of getting noise is given
by the conditional probabilities P(Y|X), which
is where we started above.
The source distribution is given by P(X)
(if it's a Markov-1 source, which we will will
assume for now just for simplicity.). Given that, we
can calculate the mutual entropy I(X;Y)
which lets us know how much we can infer about
X given that we see Y.
For a given P(Y|X) which describes the noise,
the maxiumum value of I(X;Y) over P(X) is
called the "channel capacity" :
C = max of I(X;Y) as P(X) varies
and gives (in bits of info per bit of data, assuming
the logs are all base 2) the maxium amount of
information that can practically be transmitted
successfully over that amount of noise.
Formally, "successfully" is given by the
"noisy channel coding theorem" :
Consider the data as chunked into coded
blocks of length K. The noise in the channel
means that when we decode a block, there
is a probability epsilon of having that block
interpreted incorrectly.
The "noisy channel coding theorem" states that
as long as we're keeping the information transmit
rate less than the channel capacity C, then
for any arbitrarily tiny epsilon, there exists
a code of length K that keeps the probabality
of a block decoding error less than epsilon.
This is proved in both MacKay and Briggs.
Now finish my 3-symbol example from the top of the page.
For more examples see David MacCay's
Information Theory, Inference, and Learning, chap 9