Compression:
Introduction to The Huffman algorithm

By Bluefish@11a.nu.

Abstract

The Huffman algorithms are based around the idea that a datastream which is to be compressed contains a high frequence of certain character, while others are not so common. A character which is common is assigned a short path in a tree, while an uncommon character is assigned a longer path.

Similarities with Morse-codes and Huffman

An interresting fact is that the wildly known and used Morse-codes are based upon very similar ideas as the Huffman algorithms. (this paragraph given to aid understanding Huffman if you allready are familiar with Morse, if you aren't simply skip to the next paragraph)

Example of compression using a Huffman tree

Consider the message "ABBACCAAAAA". We find that we have seven "A", two "B" and two "C". As this message is so simple, we can design a suitable tree easily:

  +
 / \
A   +
   / \
  B   C
Thus we have that the path to the leaf "A" is "0", "B" is "10" and "C" is "11" (I'm using "0" for left, "1" for right). Thus "ABBACCAAAAA", eleven characters can be represented using the tree by "010100111100000" which is only 15 bits. If each character previously was stored in 8 bits, this example has given us an compression ratio of 83% !

Why do we need the Huffman algorithm?

As seen the example, how to use the trees to compress a known data is quite simple. And, when creating small trees, you can find a good tree by simply guessing. However, what happens if you are i.e working on compressing a chineese book, and there are thousands of different characters used? Obviously, an algorithmic solution to find a good tree is needed. David A Huffman developed a algorithm which is rather simple for this purpose.

The Huffman Algorithm

In the algorithm, Huffman defines the concept of weight. A leaf has the same weight as the number of such characters to be compressed (thus, if there are 7 "A", "A" has the weight of 7. For a tree, the 'weight' is sum of all the leaves' weight. The Huffman algorithm can then be described in the following way:

  while not all trees are connected {
    take the tree with the least weight, 
    and combine it the tree with second least weight.
  }

Example #1 of Huffman algorithm for tree-generation

3A  4B  11C  23D  37E
We begin with A (weight 3), B (weight 4), and so on...
  7     11C  23D  37E
 / \
A   B
We then connect the two trees with least weights (3 and 4).
    18       23D  37E
   /  \
  7    C
 / \
A   B
We then connect the two trees with least weights (7 and 11).
       41         37E
      /  \
    18    D
   /  \
  7    C
 / \
A   B
We then connect the two trees with least weights (18 and 23).
        78
       /  \
      41   E
     /  \
    18   D
   /  \
  7    C
 / \
A   B
We then connect the two trees with least weights (37 and 41). Tada! The tree is now finnished!

Example #2 of Huffman algorithm for tree-generation

3A  4B  11C  16D  17E
We begin with A (weight 3), B (weight 4), and so on...
  7     11C  16D  17E
 / \
A   B
We then connect the two trees with least weights (3 and 4).
    18       16D  17E
   /  \
  7    C
 / \
A   B
We then connect the two trees with least weights (7 and 11).
    18      33
   /  \    /  \
  7    C  D    E
 / \
A   B
We then connect the two trees with least weights (16 and 17).
        51 
      /    \
    18      33
   /  \    /  \
  7    C  D    E
 / \
A   B
We then connect the two trees with least weights (18 and 33). Tada! The tree is now finnished!