Information
Theory

Spring 2012
course
navigation

Feb 10

First finish discussion of homework and coding lossless compression.
This site has some nice C code for this stuff :
One issue that Alex and I discussed for run length encoding : what to use as an escape char? (One answer: any repeated char; next byte is 2-255 run length or 0xx for >255 repetitions.)

noise

Read chap 8 and 9 in MacCay's book and chap5 in Bigg's textbook. (There are also some notes of mine attached, from last time I did this course, which may be helpful.)
Also good (in no particular order) are these summary articles :

an example

Consider transmission of data through a "channel" :
transmit X --> (add noise) ---> receive Y
where X and Y are the sets of symbols,
X = {x1, x2, x3, ....} Y = {y1, y2, y3, ....}
To be specific, I'll use a 3 symbol example :
X = {1, 2, 3} Y = {a, b, c}
The idea is that we're going to code [a,b,c] as [1,2,3]. If there were no noise, that would be the end of the story. But in the presence of noise, there is a probability that even though we send "1", what arrives isn't "a".
We can describe the probability of an error by the conditional probabilities. For example,
P(a|1) = 1 - f P(a|2) = f/2 P(a|3) = f/3 P(b|1) = f P(b|2) = 1-f P(b|3) = 2f/3 P(c|1) = 0 P(c|2) = f/2 P(c|3) = 1-f
where f is some small constant, e.g. f=0.02.
Quick quiz: what is P(1|a)?
Now the questions things like:
This sort of stuff has applications in all sorts of situations: losses for a given signal to noise ratio, image recognition with noisy sensors, data transmission from planetary space probes, ...
And the concepts are exceptionally cool.

definitions and claims

First make sure that joint and conditional entropy of two variables is understood :
Given two random variables X and Y, with joint probability distribution P(x,y), i.e. P(x,y) = probability of x and y.
The usual probability properties apply :
P(x) = sum over y P(x,y) P(x|y) = P(x,y)/P(y) <=> P(x,y) = P(y)*P(x|y)
Then
H(X,Y) = joint entropy = - sum P(x,y) log[ P(x,y) ]
with properties
H(X) + H(Y) >= H(X,Y) >= max(H(X), H(Y))
the joint entropy is equal to the sum iff X,Y are independent.
The conditional entropy is defined as
H(X|Y) = - sum P(x,y) log[ P(y|x) ] = H(X,Y) - H(Y) = entropy of X given Y
and the mutual information is
I(X;Y) = H(X) + H(Y) - H(X,Y)
much of which is summed up in a nice Venn diagram in the wikipedia articles and MacKay's book :
--------------------------------------------------- H(X,Y) --------------------------- H(X) ------------------ H(X|Y) --------- I(X;Y) ------------------------ H(Y|X) --------------------------------- H(Y)
Summary :
H(X,Y) is info (uncertainty) in both H(X|Y) is info (uncertainty) remaining in X after Y is known I(X;Y) is what is known about X given Y (or vice versa)
In this noisy channel situation, we take the variables to be
transmit X --> (add noise) ---> receive Y
And the probabilities of getting noise is given by the conditional probabilities P(Y|X), which is where we started above.
The source distribution is given by P(X) (if it's a Markov-1 source, which we will will assume for now just for simplicity.). Given that, we can calculate the mutual entropy I(X;Y) which lets us know how much we can infer about X given that we see Y.
For a given P(Y|X) which describes the noise, the maxiumum value of I(X;Y) over P(X) is called the "channel capacity" :
C = max of I(X;Y) as P(X) varies
and gives (in bits of info per bit of data, assuming the logs are all base 2) the maxium amount of information that can practically be transmitted successfully over that amount of noise.
Formally, "successfully" is given by the "noisy channel coding theorem" :
Consider the data as chunked into coded blocks of length K. The noise in the channel means that when we decode a block, there is a probability epsilon of having that block interpreted incorrectly. The "noisy channel coding theorem" states that as long as we're keeping the information transmit rate less than the channel capacity C, then for any arbitrarily tiny epsilon, there exists a code of length K that keeps the probabality of a block decoding error less than epsilon.
This is proved in both MacKay and Briggs.
Now finish my 3-symbol example from the top of the page.
For more examples see David MacCay's Information Theory, Inference, and Learning, chap 9
http://cs.marlboro.edu/ courses/ spring2012/information/ notes/ Feb_10
last modified Wednesday February 8 2012 9:25 pm EST

attachments [paper clip]

     name last modified size
   noise.nb Feb 8 2012 9:24 pm 57.3kB [DOC]noise.pdf Feb 8 2012 9:24 pm 153kB