Canonical Huffman by Arturo Campos

"Canonical Huffman"
by
Arturo San Emeterio Campos

Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.

Download
Download the whole article zipped.

Table of contents

Contacting the author

Introduction
This article assumes knowledge of at least static huffman. It will teach how to gain a little bit of bpb and also it will lead to less memory requirements and also faster decoding.

Canonical Huffman
Canonical huffman defines a set of rules and then based on the lengths of the codes and the symbols it makes the codes. So you don't need to save the codes along with the bitstream anymore, with the consequent savings compression terms. What are those rules which can make the codes? they are the following:

Codes with shorted code length have higher values than longer ones.
Codes with the same length increase the code value as the symbol value increases.

Those rules make an efficient decoding and even also coding coz no more ceil(log2(alphabetsize)) bits may be 1. The way this formula is interpreted is the following: ceil() means rounding to an int, log2 means... erh ... E-) If you can't perform log2, you can do: Log base2 (x) = (Log base10 (x)) / (Log base10 (2)) Most calculators only support log base10, so it may be useful. Alphabetsize is the number of symbols which you'll code.

Pseudo code
The first thing we need are a symbols and a code lengths computed via huffman, then you can start to do the codes. The pseudo code to do that isn't that difficult. Look:

First compute the first code for the maximum length, it is 'max length' bits long and his value is all 0s.
Set last_#_codes to the number of codes that the maximum length has.
Repeat from Maximum length to 0.

Code= code + last_#_codes
Code>>=1
Current length start_code = Code
last_#_codes = number of codes in Current length

Order all the lengths and symbols alphabetically. Then go thru the codelength and assign to the current code the start code value of that length, then increment that value and repeat. (note that incrementing the code value you are meeting the second rule)

Want an example? Let's say we have a file which huffman codes and code lengths are the following:

Symbol Length Code

a 2 11

b 2 00

c 3 011

d 3 101

e 4 0100

f 4 0101

i 4 1000

g 5 10010

h 5 10011

Note that we'll *not* need the codes, so we don't need to compute them, only the code lengths. Also note in the following process we don't need to keep track of the actual code length. (I assume that you have some where else in memory an array with the symbols and their lengths, and also you've probably reserved space for the codes) Before starting this loop we have to set up an array which contains the number of codes which a given length has, I don't think you need to learn how this is done E-) , for our example it will be: #codes[16]={0,2,2,3,2,0,0,0,0,0,0,0} We have 16 entries coz we assume our codes to be 16 bits long. Let's start to do the start_codes:

Count = maximum code 5
Code = 0
Start_code[count] = code
Last_#_codes = #codes[count] #codes[5]=2
--count
Loop

Code + last_#_codes = 0010 0+2=2
Code>>1 = 0001
Start_code[count]= code start_code[4]=0001
Last_#_codes = number of codes in Current length last_#_codes=3

Repeat

Code + last_#_codes = 100 1+3=4
Code>>1 = 010
Start_code[count]= code start_code[3]=010
Last_#_codes = number of codes in Current lenght last_#_codes=2

Repeat

Code + last_#_codes = 100 2+2=4
Code>>1 = 10
Start_code[count]= code start_code[2]=10
Last_#_codes = number of codes in Current length last_#_codes=2

Repeat

Code + last_#_codes = 100 2+2=4
Code>>1 = 10
Start_code[count]= code start_code[1]=010
Last_#_codes = number of codes in Current length last_#_codes=0

Stop

Remember that we don't need to keep track of the length of the current code. In this case we don't care about what the start code for the length 1 is. (we'll not use it) Now we have those start code:

start_code[5]=00000

start_code[4]=0001

start_code[3]=010

start_code[2]=10

start_code[1]=0

And those are the symbols and lengths:

Symbol	Length
a	2
b	2
c	3
d	3
e	4
f	4
i	4
g	5
h	5

Then we sort the symbols and lengths, and visit them in alphabetical order and start to assign codes:

Symbol	Length	Start code	Code	Incremented
a	2	10	10	11
b	2	11	11	00
c	3	010	010	011
d	3	011	l011	100
e	4	0001	0001	0010
f	4	0010	0l010	0011
g	5	00000	00000	00001
h	5	00001	00001	l00010
i	4	0011	0011	0100

And those codes are instantaneous ones, so they can be decoded with no problem. Also both the first rule and the second are meet, the second explicitly, and for the first, look:

Symbol Length Code Value

a 2 10 16

b 2 11 24

c 3 010 8

d 3 l011 12

e 4 0001 2

f 4 0l010 4

g 5 00000 0

h 5 00001 1

i 4 0011 6

Encoder memory usage
Remember the first rule, and what it ensured: no more ceil(log2(alphabetsize)) bits may be 1. If we are encoding an alphabetsize with 256 symbols (probably literals) then, no more than 8 bits can be 1 or 0. So if our codes are 16 bits long then the msb will be 0. 00000000XXXXXXXX Where X means it can be 0 or 1. So when saving the codes in memory we just need to save the low part of it, coz we know that the high part will be always 0. So when we have to write a code (with a fast putbits) we check if it's above 8 if it is, then we write the low part, and for the high just zeros, even we can make an special putbits which only shifts but don't or. (coz we only write 0) Before that our data structures looked something like that:

Byte, length of the code

Word, code

Thus making it not fit in word boundaries, so we usually had to make it fit in dword boundaries with something like that:

Byte, length of the code

Byte, dummy

Word, code

So accessing it was just a matter of a shift left 2. But now we know that we only need to store in memory the low part, so we can save it in that way:

Byte, length

Byte, low part of code

Thus making it fit in word boundaries and making it easy to access, also we could load it at once, and using less memory reduces the cache misses.

Saving the header
Well, this is the saving of canonical huffman we don't need to store the codes, of course we still need to pass to the decoder the lengths and symbols. It can be done in the following way: First there's 16 bytes, which mean the symbols that a given code length has. After that there's the symbols starting with those for the code length 1, then 2... in alphabetical order. It's quite clear how to decode this, isn't it? Of course once you have the lengths and symbols you do again the codes and you can start to decode.

Closing words
Based on the fact that long codes are filled with zeros may lead to a fast decoding look at: http://www.compressconsult.com/huffman/
Also you can find another paper about huffman decoding at:
http://www.internz.com/compression-pointers.html And look for pp600.ps

If you find any error in this article or want something to improve email me.

Contacting the author
You can reach me via email at: arturo@arturocampos.com Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!

Arturo San Emeterio Campos, Barcelona 23-7-1999

This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.