Compression:
Introduction to The Lempel-Ziv algorithm

By Bluefish@11a.nu.

Abstract

The Lempel-Ziv algorithm (or short, LZ) is encodes a string of bits by dividing it into phrases, and then describes the stream as a series of pairs. Each pair forms a new phrase, and contains a number (identifier of a previous phrase) and a bit (which is appended to the previous phrase). This encoding is bulky, but however once applied to a suitable string, it results in an efficient encoding (compression). The algorithm is easily implemented even in low level languages.

Note upon usability & copyright matters

Lempel-Ziv (named after Abraham Lempel and Jacob Ziv) is in the public domain and may be used freely. Be caution however, several variants or applications of the algorithm (such as LZW and LZS) are protected by patents or I.P. rights. This could cause problems, especially in the US.

Example of LZ Encoding

As my example I will use the binary string U. As it is rather short, I do not expect to archive compression. It should be noted that P(0) is the empty string.

Taking a look at U=0010001101, we see that it can also be written as U=0++010001101. Therefore, we have P(1)=0=P(0)++0.

We now continue and write U=0++01++0001101, and get P(2)=01=P(1)++1. We have now described P(2) as a combination of a previous phrase and a new bit! True, it wasn't a very great 'compression', but when working on great strings you may describe thousands of bits with such small combinations!

Next, we write U=0++01++00++01101 and get P(3)=00=P(1)++0. We now notice we have U=0++01++00++011++01 and P(4)=011=P(2)++1. And finally, we have P(5)=01=P(1)++1.

Table 1. Encoding process
Phrase#ValueFormulaU
0-P(0)0010001101
10P(1)=P(0)++00++010001101
201P(2)=P(1)++10++01++0001101
300P(3)=P(1)++00++01++00++01101
4011P(4)=P(2)++10++01++00++011++01
501P(5)=P(1)++10++01++00++011++01

The result of this work is shown in Table 1. Now, once such a table is designed in your implementation, you actually have the entire encoding schematics ready. To create the Lempel-Ziv stream, only take the formula and create the pairs, where each pair will be (A++B) if the formula is P(x)=P(A)++B. Thus P(1)=P(0)++0 becomes (00++0), P(2)=P(1)++0 becomes (01++0). And so on. (notice that 2 (10 binary) is the greatest number we need to represent the other strings. Thats why we only need two binary digits.) Concate all these pairs, and you got the final string! The result is shown in Table 2. Thus, C becomes 000011010101011 which is conciderably longer than U, but given the short length of U it was only to be expected. All formulas containing P(0) aren't compressed at all and cause groth, however that problem is only important in small examples.

Table 2. How to create the encoded string
Formulas P(1)=P(0)++0P(2)=P(1)++1 P(3)=P(1)++0P(4)=P(2)++1 P(5)=P(1)++1
Pairs 00++0 = 00001++1 = 011 01++0 = 01010++1 = 101 01++1 = 011
C 000++011++010++101++011 = 000011010101011

Note that decoding the Lempel-Ziv string C is trivial. You simply grab the pairs and begin reconstructing Table 1. The same way U was picked appart by creating the table, you'll reconstructing it with each phrase you pick from the LZ string.