Dp: Lzw, gif decoding

Lzw, gif decoding"
by
Arturo San Emeterio Campos

Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.

Download
Download the whole article zipped.

Table of contents

Putting everything together

Improvements

Closing words

Contacting the author

Introduction
Today let's learn a very common scheme: lzw. We'll use it for decoding a simple gif. As you can see I'll teach you only how to decode a simply gif, not animated not interleaved, nor such things ;-) Just the basics, and if you want more sure you'll easily learn it with all you've learn there. I'll teach you how to decode 16 colors and 256 colors gifs. You can ask: And what about 2 colors? And I'll answer, who cares about 2 colors images? ;-) Our loader will successfully load simple gifs, but whit animated gifs or two colors gif it will crash, however a simple gif loader suits my needs and probably yours too.

Theory
Let's talk a little bit about compression history. In 1977 Jacob Lempel and Abraham Ziv presented their dictionary based scheme Lz77, this leaded up to Lzss presented by James Storer and Thomas Szymanski 1982 (Hey! this is the year when I was born! ;-) And this is the scheme that today we use. (pkzip or arj) Jacob and Abraham then thought more and the arrived to Lz78, of course it was presented in 1978. This was also a dictionary based scheme, but it always outputted codewords. And then Terry Welch based in Lz78 presented, in 1984, Lzw. First it was made for implementation in hardware for high performance disk controllers and then Compuserve used it in the famous Gif, Graphics Interchange Format. (sm) Today it's very common on Inet, because as everybody knows it compressed better than jpeg for images like cartoons or drawings with computer that have a lot of repetitions, or a colour which is the background repeated a lot.

Lzw decoding
Lzw is a dictionary based scheme, remember Lz77? (if you don't then learn it, there's an article in my home page) it outputted bytes and codewords: pairs of offsets and lengths. Lzw don't, lzw outputs always Codewords, they refer to a dictionary that has 4096 entries. At the start it's filled with the possible bytes (if 8 bits image, 0-255) then it starts to fill the dictionary, every time it gets the last used Codeword, appends to it the first byte of the current codeword, and then it puts it in the dictionary. Examples of Lzw encoding will not be presented there. (If you want you can find some links at my home page) Decoding is also easy: every time you get a codeword output the string that it represents and then append the last string with the byte of the string you just outputted, so you form a new codeword. But there's a little exception (easy to handle) there's sometimes when the encoder sends you a codeword that isn't in the dictionary at all, then you just have to form this codeword, how? easy, get the last codeword, and his first byte, append them, put this codeword in the dictionary and output it. The first entry available is, in the case we have two elements, the third. So here it's the pseudo code, as you can imagine the first codeword must be handled different because then we yet don't have a last Code Word.

Last_Cw = get codeword
Output the string of Last_Cw
Loop
Current_Cw = get codeword
It's in the dictionary?
Yes

Output the string of Current_Cw
Form a new entry, with the string of Last_Cw + the first byte of Current_Cw.
Last_Cw = Current_Cw. Repeat.

Form a new entry, with the string of Last_Cw + the first byte of Last_Cw.
Output the string of Current_Cw.
Last_Cw = Current_Cw. Repeat.

Ok? This is all the decoding routine for Lzw, now let's put an example: String: 0101110

Code	Output	New Cw	Last string	Current byte
0	0	...	...	...
1	1	01(3)	0	1
3	01	10(4)	1	0
1	1	011(5)	01	1
4	10	11(6)	1	1

And let's say we get a codeword that isn't in the dictionary: String: 1110

Code Output New Cw

1 1 ...

3 11 11(3) <- We form it

0 0 110(4)

Ok? when we get the Cw 3 we yet haven't written it, so we get the last string "1" and append to it his first byte, '1' so we form the codeword with the string "11" and we can output it. Why this happens? as in all the compression schemes the decoder always is one step behind the encoder, the encoder is who says what to do, in that case that's because when the encoder outputs the Code Word 3 it has already putted in his table this Code Word.

Gif decoding
The gif implementation of lzw uses some especial code (zip implementation has some other codes, like one which tells when to increment the number of the bits for the codes) Those are:

The Clear Code. It tells when to restart the WHOLE process.
The End Of Information. It tells when there's no more codes to read.

Their values? First we must know the bpp, (Bits Per Pixel) i.e.: in a 16 colors gif the bpp is 4 (a pixel is represented with 4 bits). Once we know where the first available entry is (in this case 16 (10000b)) then the clear code is: 2^bpp and EOI is 2^bpp+1. So the first available entry for the decoder is 2^bpp+2. The Code words have a variable length, it depends on the maximum code word that there's in your dictionary, after you write a code that can't be hold with this number of bits, you increment the number of bits. I.e.: Let's say our codes are 5 bits long, then once we write the entry 31 (2^5=32) we know that may be the next code is 32 so we need to get 6 bits. The order of pixels is the following: left to right and top to bottom.

The dictionary
The dictionary in gif's lzw implementation has 4096 Code words, the maximum length of bits is 12 (2^12=4096). Each entry must hold any infinite number of bytes in a whole string. But how can we do it? The string of every entry may have any length possible, (we'll restrict this to 2^32: 4gb, so we'll have no problem at all ;-) and because we build them, they aren't pointers to the input file, so we must have a big buffer when we put our strings, we'll allocate memory for it only once. So the structure of our table will be:

Name Length Meaning

Pointer 4 It will point to the string in the buffer.

Length 4 The length of this string.

Once you start you must init the code words for the possible bytes, if our images has 16 colors, then we have to initialize the first code word=0, the second=1, the third=2 till the sixteenth. The pseudo code looks like that:

Pointer = start of buffer
Loop, while counter isn't 2^bpp

Buffer[pointer] = counter
Codeword.offset[counter] = pointer
Codeword.lenght[counter] = 1 (only one byte)
++pointer, ++counter. Repeat.

The entries [2^bpp] and [2^bpp+1] are the CC and the EOI, we'll skip and forget them. Don't worry, we'll never try to read them.

Now let's talk about the length of the buffer which holds the strings. I've did some test with gif images and they tell that the number of bytes used is common between 20k-40k, but because we should now the limit. I tried it with some pure noise files and scanned images, the maximum number of bytes that any gif used was about 182k. If we don't want to have any problem we could set this buffer to something like 300k. This will be enough. Remember: you should keep track of the position in the buffer when you can write strings, so when you write an string you increment the pointer and then save it. (we don't write to overwrite other strings)

Clear Code
The Clear code is 2^bpp. For 16 color images it's equal to 16. For 255 color images is 256. How do you calculate this? easy with the fpu, look that:

fild      gif_bpp
fld1
fscale
fistp    gif_colors
fstp    dummy_dd ;the Bits Per Pixel

;the result
;discard st

I'll not explain this because I've already written some articles that you can download from my home page. When you get the Clear Code you must reinit everything but the pointer to the decompressed image. The length of the codes is set to the minimum, the table is reinitialized, the pointer to the strings buffer must be set to the start of it. As you remember the first code that you get must be handled in a different way because you don't have last code. So I recommend getting and outputting the first code when you get the Clear Code.

Increment the length
This must be done when you write a new Code Word. What I've did was the following: Calculating the maximum value that the current value of bits can hold, then every time you write a Code Word you check if it's the maximum, if it's then increment the length of the bits for code, and also calculate a new maximum value. For example, if you are currently getting 9 bits, then once you put in your table the code 511 you must increase the number of bits. Remember that you should NOT increment the length if you put the Code Word 4095, because the maximum length of bits is 12, in that case, you don't allocate new strings, nor increment the length, but you don't have to reinit the process, you should do it only when you get clear code.

Gif format
I'll only explain the minimum, so you can uncompress a gif, but nothing else, just the basics. This info is extracted from the File Formats Enciclopedia, it can also be found in the specification of gif. I also explain what every (needed) byte means. But first let's explain how a (simple) gif is divided. It first has the global header (the header), then the pal (global color map), and then the Image descriptor, and the end Id: ';' (03Bh) An animated gif will be the same, but just after the first image the second, and this till you find the end Id.

Global header
This is the main header:

Global Header

Offset Length Kind Description

0000h 6 char ID (magic number)

0006h 1 word Image width

0008h 1 word Image height

000Ah 1 byte Bit mapped

000Bh 1 byte Color index of screen background

000Ch 1 byte Reserved

Explanation:

ID
Width
Height
Bit

Index
Reserved
It may be: 'GIF87a' 'GIF89a' This is the version number, I work with the specification of '89a' however it's very different than '87a' if you only care about decompressing one image.
Erh.. ;-) In an animated gif this is like a big screen when the rest of the frames are drew on,   but for us it's the image width.
Look above

It contains some information about the image, in bits:

   0-2 -> Bpp-1 (so if it's 3 we have 4 bpp)
   7 -> if one we have a global color map.
The background of the image, useful for animated gifs, but not for us.

That's the pixel aspect ratio, just don't worry about it. (in '89a')

Global color map
Also know as pal E-) It's 3*(2^bpp) bytes. It holds the RGB values of every indexed colour. I.e.: For a 16 colors image it's: 3*(2^4) = 48 bytes. Once you load this pal you can load directly to the screen. (I'll suppose you know how to do that ;-) Of course first there's the R value, then G B, and so on. Also note that the values are shifted to the left two times, so after putting any pal value in mode 13h (what I've used OE-) you should 'shr' it two times. Read it only if it the bit7 was 1. In our simple gifs, this pal will be always present.

Image descriptor
This is the image descriptor:

Image descriptor

Offset Length Kind Description

0000h 1 char ID

0001h 1 word Left offset of image

0003h 1 word Upper offset of image

0005h 1 word Width of image

0007h 1 word Height of image

0009h 1 byte Palette description - bitmapped

Explanation:

ID
...
Byte
This is the identification of the block: ',' (02Ch, 0x02C)

The rest four words explain where the image in the global 'window' should be drawed.

The last byte is just for animated gifs, we don't need it. Just do something with it: the Msb (the bit 7) tells if the following image has is own pal. Check it, if it's 0 exit with error. (this will avoid crashes with animated gifs E-)

As you can see we can skip this block as long as the first byte is ','. Just after the Image descriptor there's the image data (unless there's a local pal!) The first byte that find here is the (minimum bits - 1) needed for decoding the image. (Save this because you'll need it when reiniting the whole process) You get this an save it as 'gif_min_bits' or something like that. Then here it comes the Data sub-blocks. I just think this is really STUPID. It doesn't help for error recovering at all. It just adds 1 byte for every 255, I don't know why it's used. But here it is the explanation: Once you get the minimum bits (1 byte) you start a loop. Read a byte, if it's 0 then break loop. If it's not, then read to the getbits buffer the value specified and repeat. Note that three bytes saved in this way are: 03h XXh XXh XXh 00h Also note that this process always ends when (and only when) you read the value 0.

Getbits
For the getbits you must have a buffer with the image data, wich you've previously readed (with the method explained above). The bit order? From right to left. Right part is right part of the code. E-) I.e.: You have the following 5 bits codes: '54321' 'A9876' 'FEDCB' then the ouput is:

First byte MSB '87654321' LSB

Second byte MSB 'XFEDCBA9' LSB (X doesn't matter at all)

Doing a routine for getting those code shouldn't be very difficult. The only thing you have to do is keep track of how many bits you've writed to the output byte, and rotate the new bit as many positions as bits you've drawed. I.e.: Read the first code of the above example:

- Get bit 1 put in position 0. Output byte: '1'
- Get bit 2 put in position 1. Output byte: '21'
- Get bit 3 put in position 2. Output byte: '321'
- ...

Ok? The way you may get the bits is trough the carry (with shifts) for the shifting based on the position a 'shl al,cl' with cl as the counter of the bits read, will be ok. This is the Getbits routine, but you should do a routine that gets the code. It'll draw as may bits as specified in a variable (gif_cur_bits, the number of bits which currently a Code Word has) and then return the code. (this may be the variable that keeps track of how many bits are need for the Code Words)

Putting everything together
The first code that you'll always get is the CC, so I recommend you starting getting a code. Because it will be CC you'll reinit the table and get the first code (as last) and then you can start the loop.

Get bpp, and read the buffer of the getbits
Precalculate CC, Eoi and the next available Code Word
Get codeword
Call reinit_process
Loop

Current_Cw = get codeword
It's CC?
Yes

Call reinit_process. Repeat.

It's Eoi?
Yes

Break loop

It's in the dictionary? (value not equal to the next Cw available)
Yes

Output the string of Current_Cw
Form a new entry, with the string of Last_Cw + the first byte of Current_Cw.
Last_Cw = Current_Cw. Repeat.

Form a new entry, with the string of Last_Cw + the first byte of Last_Cw.
Output the string of Current_Cw
Last_Cw = Current_Cw. Repeat.

Display uncompressed buffer E-)

Look at the part when we see if it's in the dictionary. What you should do is the following: keep track of what's the next code to write. This has a double purpose: Putting the next code word in the first available entry. And knowing if the encoder has outputted a value that still isn't in the dictionary. I.e.: If you have written 4 codewords then the value should be 5. So if you get a code word and is 5 you know that it isn't in the dictionary. (in this case no enconder will output a value above 5) Note that in this process you could have an error detection routine. If the value is above than the next available Code Word, then this means that the data is corrupted. The code of the reinit_process looks like that:

Reinit table
Reinit pointer to strings buffer
Bits to get = minimum of bits to get a code
Get a code and put it as last

The process of putting a new Code Word in the dictionary is the following:

Table.pointer[current_cw] = pointer to string table
Table.size[current_cw] = table.size[last_cw] + 1
String_table[pointer] = string of last_cw
String_table[pointer] = first byte from current_cw
Increment the value of the next available Code Word
have we write the maximum Cw?
Yes

Have we write Cw 4095?
Yes

Do nothing

Increment bits to get a Code Word

Nothing else, just put everything together in the language that you prefer, and you'll have a gif loader.

Improvements
The only improvements that you can achieve with a (lossless) decompressor is more speed. How to speed it up? first we should start doing a fast getbits, (see my article about optimizing the putbits) this will help, but the most important speed ups which can be done, are those of copying the phrases. May be you could use the fpu, or the Mmx, or copying at least 4 bytes at a time.

Closing words
We've learned a new scheme and a very used one, Lzw. Now we can try to handle Bw images, or animated gifs, or gifs with any possible resolution. Also if you feel you want, you can do a text compressor (lossless) with Lzw scheme. But remember that Lzw is patented, and you can only use it freely as long as your program isn't commercial. (Demo, intro or whatever) Hope you've learned it without problems, but if you had any, let me know.

Contacting the author
You can reach me via email at: arturo@arturocampos.com Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!

Arturo San Emeterio Campos, Barcelona 23-07-1999

This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.

Name	Length	Meaning
Pointer	4	It will point to the string in the buffer.
Length	4	The length of this string.

Global Header
Offset	Length	Kind	Description
0000h	6	char	ID (magic number)
0006h	1	word	Image width
0008h	1	word	Image height
000Ah	1	byte	Bit mapped
000Bh	1	byte	Color index of screen background
000Ch	1	byte	Reserved