"Canonical Huffman"
by
Arturo San Emeterio Campos







Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.
 

Download
Download the whole article zipped.
 

Table of contents

  • Introduction
  • Canonical Huffman
  • Pseudo code
  • Encoder memory usage
  • Saving the header
  • Closing words
  • Contacting the author

  •  

     
     
     
     
     
     
     
     
     
     
     

    Introduction
    This article assumes knowledge of at least static huffman. It will teach how to gain a little bit of bpb and also it will lead to less memory requirements and also faster decoding.
     

    Canonical Huffman
    Canonical huffman defines a set of rules and then based on the lengths of the codes and the symbols it makes the codes. So you don't need to save the codes along with the bitstream anymore, with the consequent savings compression terms. What are those rules which can make the codes? they are the following:

    1. Codes with shorted code length have higher values than longer ones.
    2. Codes with the same length increase the code value as the symbol value increases.
    Those rules make an efficient decoding and even also coding coz no more ceil(log2(alphabetsize)) bits may be 1. The way this formula is interpreted is the following: ceil() means rounding to an int, log2 means... erh ... E-) If you can't perform log2, you can do:   Log base2 (x)  =  (Log base10 (x)) / (Log base10 (2))  Most calculators only support log base10, so it may be useful. Alphabetsize is the number of symbols which you'll code.
     

    Pseudo code
    The first thing we need are a symbols and a code lengths computed via huffman, then you can start to do the codes. The pseudo code to do that isn't that difficult. Look:

    Order all the lengths and symbols alphabetically. Then go thru the codelength and assign to the current code the start code value of that length, then increment that value and repeat. (note that incrementing the code value you are meeting the second rule)

    Want an example? Let's say we have a file which huffman codes and code lengths are the following:
     
    Symbol  Length Code 
    a
    2
    11
    b
    2
    00
    c
    3
    011
    d
    3
    101
    e
    4
    0100
    f
    4
    0101
    i
    4
    1000
    g
    5
    10010
    h
    5
    10011 

    Note that we'll *not* need the codes, so we don't need to compute them, only the code lengths. Also note in the following process we don't need to keep track of the actual code length. (I assume that you have some where else in memory an array with the symbols and their lengths, and also you've probably reserved space for the codes) Before starting this loop we have to set up an array which contains the number of codes which a given length has, I don't think you need to learn how this is done E-) , for our example it will be:   #codes[16]={0,2,2,3,2,0,0,0,0,0,0,0}  We have 16 entries coz we assume our codes to be 16 bits long. Let's start to do the start_codes:

    Remember that we don't need to keep track of the length of the current code. In this case we don't care about what the start code for the length 1 is. (we'll not use it) Now we have those start code:
  • start_code[5]=00000
  • start_code[4]=0001
  • start_code[3]=010
  • start_code[2]=10
  • start_code[1]=0

  • And those are the symbols and lengths:
     
    Symbol  Length 
    a
    2
    b
    2
    c
    3
    d
    3
    e
    4
    f
    4
    i
    4
    g
    5
    h
    5
    Then we sort the symbols and lengths, and visit them in alphabetical order and start to assign codes:
     
    Symbol  Length  Start code Code  Incremented 
    a
    2
    10 10 11
    b
    2
    11 11 00
    c
    3
    010 010 011
    d
    3
    011 l011 100
    e
    4
    0001 0001 0010
    f
    4
    0010 0l010 0011
    g
    5
    00000 00000  00001
    h
    5
    00001 00001 l00010
    i
    4
    0011 0011 0100

    And those codes are instantaneous ones, so they can be decoded with no problem. Also both the first rule and the second are meet, the second explicitly, and for the first, look:
     
    Symbol  Length  Code  Value 
    a
    2
    10 16
    b
    2
    11 24
    c
    3
    010 8
    d
    3
    l011 12
    e
    4
    0001 2
    f
    4
    0l010 4
    g
    5
    00000  0
    h
    5
    00001 1
    i
    4
    0011 6

    Encoder memory usage
    Remember the first rule, and what it ensured: no more ceil(log2(alphabetsize)) bits may be 1. If we are encoding an alphabetsize with 256 symbols (probably literals) then, no more than 8 bits can be 1 or 0. So if our codes are 16 bits long then the msb will be 0.  00000000XXXXXXXX  Where X means it can be 0 or 1. So when saving the codes in memory we just need to save the low part of it, coz we know that the high part will be always 0. So when we have to write a code (with a fast putbits) we check if it's above 8 if it is, then we write the low part, and for the high just zeros, even we can make an special putbits which only shifts but don't or. (coz we only write 0) Before that our data structures looked something like that:

  • Byte, length of the code
  • Word, code

  • Thus making it not fit in word boundaries, so we usually had to make it fit in dword boundaries with something like that:
  • Byte, length of the code
  • Byte, dummy
  • Word, code

  • So accessing it was just a matter of a shift left 2. But now we know that we only need to store in memory the low part, so we can save it in that way:
  • Byte, length
  • Byte, low part of code

  • Thus making it fit in word boundaries and making it easy to access, also we could load it at once, and using less memory reduces the cache misses.
     

    Saving the header
    Well, this is the saving of canonical huffman we don't need to store the codes, of course we still need to pass to the decoder the lengths and symbols. It can be done in the following way: First there's 16 bytes, which mean the symbols that a given code length has. After that there's the symbols starting with those for the code length 1, then 2... in alphabetical order. It's quite clear how to decode this, isn't it? Of course once you have the lengths and symbols you do again the codes and you can start to decode.
     

    Closing words
    Based on the fact that long codes are filled with zeros may lead to a fast decoding look at: http://www.compressconsult.com/huffman/
    Also you can find another paper about huffman decoding at:
    http://www.internz.com/compression-pointers.html And look for pp600.ps

    If you find any error in this article or want something to improve email me.
     

    Contacting the author
    You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!
     
     

    Arturo San Emeterio Campos, Barcelona 23-7-1999


    This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.