Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped.
Table of contents
Introduction
This article assumes knowledge of at least static huffman. It will
teach how to gain a little bit of bpb and also it will lead to less memory
requirements and also faster decoding.
Canonical Huffman
Canonical huffman defines a set of rules and then based on the lengths
of the codes and the symbols it makes the codes. So you don't need to save
the codes along with the bitstream anymore, with the consequent savings
compression terms. What are those rules which can make the codes? they
are the following:
Pseudo code
The first thing we need are a symbols and a code lengths computed via
huffman, then you can start to do the codes. The pseudo code to do that
isn't that difficult. Look:
Want an example? Let's say we have a file which huffman codes and code
lengths are the following:
Symbol | Length | Code |
|
|
11 |
|
|
00 |
|
|
011 |
|
|
101 |
|
|
0100 |
|
|
0101 |
|
|
1000 |
|
|
10010 |
|
|
10011 |
Note that we'll *not* need the codes, so we don't need to compute them, only the code lengths. Also note in the following process we don't need to keep track of the actual code length. (I assume that you have some where else in memory an array with the symbols and their lengths, and also you've probably reserved space for the codes) Before starting this loop we have to set up an array which contains the number of codes which a given length has, I don't think you need to learn how this is done E-) , for our example it will be: #codes[16]={0,2,2,3,2,0,0,0,0,0,0,0} We have 16 entries coz we assume our codes to be 16 bits long. Let's start to do the start_codes:
Symbol | Length |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Symbol | Length | Start code | Code | Incremented |
|
|
10 | 10 | 11 |
|
|
11 | 11 | 00 |
|
|
010 | 010 | 011 |
|
|
011 | l011 | 100 |
|
|
0001 | 0001 | 0010 |
|
|
0010 | 0l010 | 0011 |
|
|
00000 | 00000 | 00001 |
|
|
00001 | 00001 | l00010 |
|
|
0011 | 0011 | 0100 |
And those codes are instantaneous ones, so they can be decoded with
no problem. Also both the first rule and the second are meet, the second
explicitly, and for the first, look:
Symbol | Length | Code | Value |
|
|
10 | 16 |
|
|
11 | 24 |
|
|
010 | 8 |
|
|
l011 | 12 |
|
|
0001 | 2 |
|
|
0l010 | 4 |
|
|
00000 | 0 |
|
|
00001 | 1 |
|
|
0011 | 6 |
Encoder memory usage
Remember the first rule, and what it ensured: no more ceil(log2(alphabetsize))
bits may be 1. If we are encoding an alphabetsize with 256 symbols (probably
literals) then, no more than 8 bits can be 1 or 0. So if our codes are
16 bits long then the msb will be 0. 00000000XXXXXXXX Where
X means it can be 0 or 1. So when saving the codes in memory we just need
to save the low part of it, coz we know that the high part will be always
0. So when we have to write a code (with a fast
putbits) we check if it's above 8 if it is, then we write the low part,
and for the high just zeros, even we can make an special putbits which
only shifts but don't or. (coz we only write 0) Before that our data structures
looked something like that:
Saving the header
Well, this is the saving of canonical huffman we don't need to store
the codes, of course we still need to pass to the decoder the lengths and
symbols. It can be done in the following way: First there's 16 bytes, which
mean the symbols that a given code length has. After that there's the symbols
starting with those for the code length 1, then 2... in alphabetical order.
It's quite clear how to decode this, isn't it? Of course once you have
the lengths and symbols you do again the codes and you can start to decode.
Closing words
Based on the fact that long codes are filled with zeros may lead to
a fast decoding look at: http://www.compressconsult.com/huffman/
Also you can find another paper about huffman decoding at:
http://www.internz.com/compression-pointers.html
And look for pp600.ps
If you find any error in this article or want something to improve email
me.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.