Feb 23
noise
Read chapter 5 in our Biggs "Codes" textbook.
Similar material is in chap 8 and 9 from
MacCay's book
which may be helpful if Biggs isn't clear to you.
Also good (in no particular order) are these summary articles :
For this class, I'm more interested in the concepts
and algorithmic calculations than the proofs.
Too many symbols?
Be aware that there are oodles of math-ish definitions and
symbols in this chapter. I suggest working a specific
problem, and creating a summary for yourself of what
all the variables are and what they mean for that problem.
The assignment for next week is posted - basically
working through several such problems.
the basic idea
The basic question is this: if errors occur during transmission
of a message, how much information is left?
The extreme cases are pretty obvious. (a) If there's no noise,
then the information (i.e. entropy) arriving after transmission
is whatever was put in. (b) If we're sending bits, and it's
so noisy that the probability of a 0 turning to 1 is 0.5
and vice versa, then any information about the source has
been lost : every starting signal turns into pure noise.
But realistic cases, where there is a small probability of error,
are more complicated. The details involve conditional probability,
a new "conditional entropy" concept, and a the notion of
"channel capacity" : given some noise probabilities, what's
the most information you can push through?
an example
Consider transmission of data through a "channel" :
transmit X --> (add noise) ---> receive Y
where X and Y are the sets of symbols,
X = {x1, x2, x3, ....}
Y = {y1, y2, y3, ....}
To be specific, I'll use a 3 symbol example :
X = {1, 2, 3}
Y = {a, b, c}
The idea is that we're going to code [a,b,c] as [1,2,3].
If there were no noise, that would be the end of the story.
But in the presence of noise, there is a probability
that even though we send "1", what arrives isn't "a".
We can describe the probability of an error by the
conditional probabilities. For example,
P(a|1) = 1 - f P(a|2) = f/2 P(a|3) = f/3
P(b|1) = f P(b|2) = 1-f P(b|3) = 2f/3
P(c|1) = 0 P(c|2) = f/2 P(c|3) = 1-f
where f is some small constant, e.g. f=0.02.
(BE CAREFUL: the book uses the transpose of
this matrix, in which the rows add to one.
Either is a way to describe the data, depending
on how you use it - it's just which convention
you have.)
Quick quiz: what is P(1|a)?
ANSWER: This is a trick question - not enough information has been given. If the JOINT probability distribution P(letter,number) = probability of getting that letter and that number, then everything can be calculated. As it is, we have not defined P(a), so we don't know P(1|a).
Now the questions are things like:
- How much information (bits of info per bits of data) does this noise cost us?
- What sorts of coding schemes should we use to maximize the information throughput?
- How can the "entropy" concept be extended to explain "information flow" in this case?
This sort of stuff has applications in all sorts of situations: losses for a given signal to noise ratio, image recognition with noisy sensors, data transmission from planetary space probes, ...
And the concepts are exceptionally cool.
definitions and claims
First make sure that joint and conditional entropy
of two variables is understood :
Given two random variables X and Y,
with joint probability distribution P(x,y),
i.e. P(x,y) = probability of x and y.
The usual probability properties apply :
P(x) = sum over y P(x,y)
P(x|y) = P(x,y)/P(y) <=> P(x,y) = P(y)*P(x|y)
Then
H(X,Y) = joint entropy
= - sum P(x,y) log[ P(x,y) ]
with properties
H(X) + H(Y) >= H(X,Y) >= max(H(X), H(Y))
the joint entropy is equal to the sum iff X,Y are independent.
The conditional entropy is defined as
H(X|Y) = - sum P(x,y) log[ P(y|x) ]
= H(X,Y) - H(Y)
= entropy of X given Y
and the mutual information is
I(X;Y) = H(X) + H(Y) - H(X,Y)
much of which is summed up in a nice Venn diagram in the
wikipedia articles and MacKay's book :
--------------------------------------------------- H(X,Y)
--------------------------- H(X)
------------------ H(X|Y)
--------- I(X;Y)
------------------------ H(Y|X)
--------------------------------- H(Y)
Summary :
H(X,Y) is info (uncertainty) in both
H(X|Y) is info (uncertainty) remaining in X after Y is known
I(X;Y) is what is known about X given Y (or vice versa)
In this noisy channel situation, we
take the variables to be
transmit X --> (add noise) ---> receive Y
And the probabilities of getting noise is given
by the conditional probabilities P(Y|X), which
is where we started above.
The source distribution is given by P(X)
(if it's a Markov-1 source, which we will will
assume for now just for simplicity.). Given that, we
can calculate the mutual entropy I(X;Y)
which lets us know how much we can infer about
X given that we see Y.
For a given P(Y|X) which describes the noise,
the maxiumum value of I(X;Y) over P(X) is
called the "channel capacity" :
C = max of I(X;Y) as P(X) varies
and gives (in bits of info per bit of data, assuming
the logs are all base 2) the maxium amount of
information that can practically be transmitted
successfully over that amount of noise.
Formally, "successfully" is given by the
"noisy channel coding theorem" :
Consider the data as chunked into coded
blocks of length K. The noise in the channel
means that when we decode a block, there
is a probability epsilon of having that block
interpreted incorrectly.
The "noisy channel coding theorem" states that
as long as we're keeping the information transmit
rate less than the channel capacity C, then
for any arbitrarily tiny epsilon, there exists
a code of length K that keeps the probabality
of a block decoding error less than epsilon.
This is proved in both MacKay and Briggs.
Now finish my 3-symbol example from the top of the page.
For more examples see David MacCay's
Information Theory, Inference, and Learning, chap 9