March 31
Huffman coding
Review the basic ideas behind huffman coding.
Do one more example in class: "go go gopher"
Decoding: how can we pass the table?
How can we do so efficiently? (See for example "canonical huffman".)
Limiting cases: What happens when the bits per symbol is very small? Very big?
What specifically are the steps,
subroutines, sub-algorithms, data structures
needed to code it? (Mention issues of bit-manipulations and byte boundaries.)
LZW
Walk through maknelson's explanation of
LZW encode and decode.
The essential idea is that the encoder builds
of table of (code, pattern) by adding patterns
of the form Sc (where S is a string it has
already got in its table and c is the next char)
into the table with the next sequential code.
Since each (code, pattern) is added to the table
before it is used to output a code, the decoder
doesn't need the table: it can build it up
for itself as it goes.
Several variations,
particularly in variable width code lengths,
when to "resetting" the table if file
patterns change, etc. Commonly used for image compression,
for example *.gif files.
One of the first widely used (though it was patent
encumbered) compression schemes.
The decoding "gotcha":
1) c is a character
2) S is a string
3) cS is already in encoder table
4) cSc is not in table yet
5) encoder tries to handle "cScSc" :
i) outputs code for cS
ii) adds cSc to the table with a new code
iii) but then (starting from that last c)
immediately sees cSc again,
and so outputs the new code right away.
The decoder sees the new code before it
has a chance to get it into its table.
The solution: this is the only case where
this issue comes up, so if the decoder
sees an unknown code, it figues out that
it must be cSc and adds that to the table.
(After all, it just saw cS, and that is
in the table.)
LZW (and its many variations) adapt to
the input, and generally give better compression
than Huffman ... but the two can complement
each other; sometimes both are done sequentially.
LZW is better at dealing with longer patterns
("cat" shows up much more often than "cta",
as the wikipedia article says), while Huffman
works at the symbol level to make common symbols
smaller.
Resources :