Algorithms

Spring 2011
course
navigation

March 31

Huffman coding

Review the basic ideas behind huffman coding.
Do one more example in class: "go go gopher"
Decoding: how can we pass the table? How can we do so efficiently? (See for example "canonical huffman".)
Limiting cases: What happens when the bits per symbol is very small? Very big?
What specifically are the steps, subroutines, sub-algorithms, data structures needed to code it? (Mention issues of bit-manipulations and byte boundaries.)

LZW

History: Dr Ross's Compression Crypt
Walk through maknelson's explanation of LZW encode and decode.
The essential idea is that the encoder builds of table of (code, pattern) by adding patterns of the form Sc (where S is a string it has already got in its table and c is the next char) into the table with the next sequential code.
Since each (code, pattern) is added to the table before it is used to output a code, the decoder doesn't need the table: it can build it up for itself as it goes.
Several variations, particularly in variable width code lengths, when to "resetting" the table if file patterns change, etc. Commonly used for image compression, for example *.gif files.
One of the first widely used (though it was patent encumbered) compression schemes.
The decoding "gotcha":
1) c is a character 2) S is a string 3) cS is already in encoder table 4) cSc is not in table yet 5) encoder tries to handle "cScSc" : i) outputs code for cS ii) adds cSc to the table with a new code iii) but then (starting from that last c) immediately sees cSc again, and so outputs the new code right away.
The decoder sees the new code before it has a chance to get it into its table.
The solution: this is the only case where this issue comes up, so if the decoder sees an unknown code, it figues out that it must be cSc and adds that to the table. (After all, it just saw cS, and that is in the table.)
LZW (and its many variations) adapt to the input, and generally give better compression than Huffman ... but the two can complement each other; sometimes both are done sequentially. LZW is better at dealing with longer patterns ("cat" shows up much more often than "cta", as the wikipedia article says), while Huffman works at the symbol level to make common symbols smaller.
Resources :
http://cs.marlboro.edu/ courses/ spring2011/algorithms/ notes/ March_31
last modified Thursday March 31 2011 3:02 am EDT

attachments [paper clip]

     name last modified size
[COD]huff.c Mar 30 2011 11:43 pm 1.07kB [COD]jims_utils.c Mar 30 2011 11:43 pm 2.10kB    jims_utils.h Mar 30 2011 11:44 pm 504B    make Mar 30 2011 11:43 pm 113B [TXT]moby_dick.txt Mar 30 2011 11:43 pm 1.20MB