Information
Theory

Spring 2017
course
navigation

Feb 23

Go over lossy homework. My answer to the toy DCT is in four by four answer .

noise

Read chapter 5 in our Biggs "Codes" textbook.
Similar material is in chap 8 and 9 from MacCay's book which may be helpful if Biggs isn't clear to you.
Also check out my own noise notes which may of use, with the same notation as Biggs.
Also good (in no particular order) are these summary articles :
For this class, I'm more interested in the concepts and algorithmic calculations than the proofs.

Too many symbols?

Be aware that there are oodles of math-ish definitions and symbols in this chapter. I suggest working a specific problem, and creating a summary for yourself of what all the variables are and what they mean for that problem.
The assignment for next week is posted - basically working through several such problems.

the basic idea

The basic question is this: if errors occur during transmission of a message, how much information is left?
The extreme cases are pretty obvious. (a) If there's no noise, then the information (i.e. entropy) arriving after transmission is whatever was put in. (b) If we're sending bits, and it's so noisy that the probability of a 0 turning to 1 is 0.5 and vice versa, then any information about the source has been lost : every starting signal turns into pure noise.
But realistic cases, where there is a small probability of error, are more complicated. The details involve conditional probability, a new "conditional entropy" concept, and a the notion of "channel capacity" : given some noise probabilities, what's the most information you can push through?

an example

Consider transmission of data through a "channel" :
transmit X --> (add noise) ---> receive Y
where X and Y are the sets of symbols,
X = {x1, x2, x3, ....} Y = {y1, y2, y3, ....}
To be specific, I'll use a 3 symbol example :
X = {1, 2, 3} Y = {a, b, c}
The idea is that we're going to code [a,b,c] as [1,2,3]. If there were no noise, that would be the end of the story. But in the presence of noise, there is a probability that even though we send "1", what arrives isn't "a".
We can describe the probability of an error by the conditional probabilities. For example,
P(a|1) = 1 - f P(a|2) = f/2 P(a|3) = f/3 P(b|1) = f P(b|2) = 1-f P(b|3) = 2f/3 P(c|1) = 0 P(c|2) = f/2 P(c|3) = 1-f
where f is some small constant, e.g. f=0.02.
(BE CAREFUL: the book uses the transpose of this matrix, in which the rows add to one. Either is a way to describe the data, depending on how you use it - it's just which convention you have.)
Quick quiz: what is P(1|a)?
ANSWER: This is a trick question - not enough information has been given. If the JOINT probability distribution P(letter,number) = probability of getting that letter and that number, then everything can be calculated. As it is, we have not defined P(a), so we don't know P(1|a).
Now the questions are things like:
This sort of stuff has applications in all sorts of situations: losses for a given signal to noise ratio, image recognition with noisy sensors, data transmission from planetary space probes, ...
And the concepts are exceptionally cool.

definitions and claims

First make sure that joint and conditional entropy of two variables is understood :
Given two random variables X and Y, with joint probability distribution P(x,y), i.e. P(x,y) = probability of x and y.
The usual probability properties apply :
P(x) = sum over y P(x,y) P(x|y) = P(x,y)/P(y) <=> P(x,y) = P(y)*P(x|y)
Then
H(X,Y) = joint entropy = - sum P(x,y) log[ P(x,y) ]
with properties
H(X) + H(Y) >= H(X,Y) >= max(H(X), H(Y))
the joint entropy is equal to the sum iff X,Y are independent.
The conditional entropy is defined as
H(X|Y) = - sum P(x,y) log[ P(y|x) ] = H(X,Y) - H(Y) = entropy of X given Y
and the mutual information is
I(X;Y) = H(X) + H(Y) - H(X,Y)
much of which is summed up in a nice Venn diagram in the wikipedia articles and MacKay's book :
--------------------------------------------------- H(X,Y) --------------------------- H(X) ------------------ H(X|Y) --------- I(X;Y) ------------------------ H(Y|X) --------------------------------- H(Y)
Summary :
H(X,Y) is info (uncertainty) in both H(X|Y) is info (uncertainty) remaining in X after Y is known I(X;Y) is what is known about X given Y (or vice versa)
In this noisy channel situation, we take the variables to be
transmit X --> (add noise) ---> receive Y
And the probabilities of getting noise is given by the conditional probabilities P(Y|X), which is where we started above.
The source distribution is given by P(X) (if it's a Markov-1 source, which we will will assume for now just for simplicity.). Given that, we can calculate the mutual entropy I(X;Y) which lets us know how much we can infer about X given that we see Y.
For a given P(Y|X) which describes the noise, the maxiumum value of I(X;Y) over P(X) is called the "channel capacity" :
C = max of I(X;Y) as P(X) varies
and gives (in bits of info per bit of data, assuming the logs are all base 2) the maxium amount of information that can practically be transmitted successfully over that amount of noise.
Formally, "successfully" is given by the "noisy channel coding theorem" :
Consider the data as chunked into coded blocks of length K. The noise in the channel means that when we decode a block, there is a probability epsilon of having that block interpreted incorrectly. The "noisy channel coding theorem" states that as long as we're keeping the information transmit rate less than the channel capacity C, then for any arbitrarily tiny epsilon, there exists a code of length K that keeps the probabality of a block decoding error less than epsilon.
This is proved in both MacKay and Briggs.
Now finish my 3-symbol example from the top of the page.
For more examples see David MacCay's Information Theory, Inference, and Learning, chap 9
http://cs.marlboro.edu/ courses/ spring2017/info/ notes/ Feb_23
last modified Thursday February 23 2017 2:00 am EST