Feb 10

First finish discussion of homework and coding lossless compression.

This site has some nice C code for this stuff :

Dipperstein compression implementations

One issue that Alex and I discussed for run length encoding : what to use as an escape char? (One answer: any repeated char; next byte is 2-255 run length or 0xx for >255 repetitions.)

noise

Read chap 8 and 9 in MacCay's book and chap5 in Bigg's textbook. (There are also some notes of mine attached, from last time I did this course, which may be helpful.)

Also good (in no particular order) are these summary articles :

an example

Consider transmission of data through a "channel" :

  transmit X -->  (add noise) ---> receive Y

where X and Y are the sets of symbols,

  X = {x1, x2, x3, ....}
  Y = {y1, y2, y3, ....}

To be specific, I'll use a 3 symbol example :

  X = {1, 2, 3}
  Y = {a, b, c}

The idea is that we're going to code [a,b,c] as [1,2,3]. If there were no noise, that would be the end of the story. But in the presence of noise, there is a probability that even though we send "1", what arrives isn't "a".

We can describe the probability of an error by the conditional probabilities. For example,

  P(a|1) = 1 - f     P(a|2) = f/2     P(a|3) = f/3
  P(b|1) = f         P(b|2) = 1-f     P(b|3) = 2f/3
  P(c|1) = 0         P(c|2) = f/2     P(c|3) = 1-f

where f is some small constant, e.g. f=0.02.

Quick quiz: what is P(1|a)?

Now the questions things like:

How much information (bits of info per bits of data) does this noise cost us?
What sorts of coding schemes should we use to maximize the information throughput?
How can the "entropy" concept be extended to explain "information flow" in this case?

This sort of stuff has applications in all sorts of situations: losses for a given signal to noise ratio, image recognition with noisy sensors, data transmission from planetary space probes, ...

And the concepts are exceptionally cool.

definitions and claims

First make sure that joint and conditional entropy of two variables is understood :

Given two random variables X and Y, with joint probability distribution P(x,y), i.e. P(x,y) = probability of x and y.

The usual probability properties apply :

   P(x) = sum over y P(x,y)
   P(x|y) = P(x,y)/P(y)   <=>   P(x,y) = P(y)*P(x|y)

Then

   H(X,Y) = joint entropy
          = - sum P(x,y) log[ P(x,y) ]

with properties

   H(X) + H(Y) >= H(X,Y) >= max(H(X), H(Y))

the joint entropy is equal to the sum iff X,Y are independent.

The conditional entropy is defined as

   H(X|Y) = - sum P(x,y) log[ P(y|x) ]
          = H(X,Y) - H(Y)
          = entropy of X given Y

and the mutual information is

   I(X;Y) = H(X) + H(Y) - H(X,Y)

much of which is summed up in a nice Venn diagram in the wikipedia articles and MacKay's book :

 --------------------------------------------------- H(X,Y)
 
 --------------------------- H(X)
 
 ------------------ H(X|Y)
 
                   --------- I(X;Y)
 
                            ------------------------ H(Y|X)
 
                   --------------------------------- H(Y)

Summary :

 H(X,Y) is info (uncertainty) in both
 H(X|Y) is info (uncertainty) remaining in X after Y is known
 I(X;Y) is what is known about X given Y (or vice versa)

In this noisy channel situation, we take the variables to be

     transmit X -->  (add noise) ---> receive Y

And the probabilities of getting noise is given by the conditional probabilities P(Y|X), which is where we started above.

The source distribution is given by P(X) (if it's a Markov-1 source, which we will will assume for now just for simplicity.). Given that, we can calculate the mutual entropy I(X;Y) which lets us know how much we can infer about X given that we see Y.

For a given P(Y|X) which describes the noise, the maxiumum value of I(X;Y) over P(X) is called the "channel capacity" :

 C = max of I(X;Y) as P(X) varies

and gives (in bits of info per bit of data, assuming the logs are all base 2) the maxium amount of information that can practically be transmitted successfully over that amount of noise.

Formally, "successfully" is given by the "noisy channel coding theorem" :

  Consider the data as chunked into coded
  blocks of length K. The noise in the channel
  means that when we decode a block, there 
  is a probability epsilon of having that block
  interpreted incorrectly.
 
  The "noisy channel coding theorem" states that 
  as long as we're keeping the information transmit
  rate less than the channel capacity C, then 
  for any arbitrarily tiny epsilon, there exists
  a code of length K that keeps the probabality 
  of a block decoding error less than epsilon.

This is proved in both MacKay and Briggs.

Now finish my 3-symbol example from the top of the page.

For more examples see David MacCay's Information Theory, Inference, and Learning, chap 9

http://cs.marlboro.edu/ courses/ spring2012/information/ notes/ Feb_10
last modified Wednesday February 8 2012 9:25 pm EST

attachments

name last modified size

InformationTheory

course

navigation

Feb 10

noise

an example

definitions and claims

attachments

Information
Theory