Lzw, gif decoding"
by
Arturo San Emeterio Campos






Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.
 

Download
Download the whole article zipped.
 

Table of contents

  • Introduction
  • Theory
  • Lzw decoding
  • Gif decoding
  • The dictionary
  • Clear Code
  • Increment the length
  • Gif format
  • Global Header
  • Global color map
  • Image descriptor
  • Getbits
  • Putting everything together
  • Improvements
  • Closing words
  • Contacting the author

  •  

     
     
     
     
     

    Introduction
    Today let's learn a very common scheme: lzw. We'll use it for decoding a simple gif. As you can see I'll teach you only how to decode a simply gif, not animated not interleaved, nor such things ;-) Just the basics, and if you want more sure you'll easily learn it with all you've learn there. I'll teach you how to decode 16 colors and 256 colors gifs. You can ask: And what about 2 colors? And I'll answer, who cares about 2 colors images? ;-) Our loader will successfully load simple gifs, but whit animated gifs or two colors gif it will crash, however a simple gif loader suits my needs and probably yours too.
     

    Theory
    Let's talk a little bit about compression history. In 1977 Jacob Lempel and Abraham Ziv presented their dictionary based scheme Lz77, this leaded up to Lzss presented by James Storer and Thomas Szymanski 1982 (Hey! this is the year when I was born! ;-) And this is the scheme that today we use. (pkzip or arj) Jacob and Abraham then thought more and the arrived to Lz78, of course it was presented in 1978. This was also a dictionary based scheme, but it always outputted codewords. And then Terry Welch based in Lz78 presented, in 1984, Lzw. First it was made for implementation in hardware for high performance disk controllers and then Compuserve used it in the famous Gif, Graphics Interchange Format. (sm) Today it's very common on Inet, because as everybody knows it compressed better than jpeg for images like cartoons or drawings with computer that have a lot of repetitions, or a colour which is the background repeated a lot.
     

    Lzw decoding
    Lzw is a dictionary based scheme, remember Lz77? (if you don't then learn it, there's an article in my home page) it outputted bytes and codewords: pairs of offsets and lengths. Lzw don't, lzw outputs always Codewords, they refer to a dictionary that has 4096 entries. At the start it's filled with the possible bytes (if 8 bits image, 0-255) then it starts to fill the dictionary, every time it gets the last used Codeword, appends to it the first byte of the current codeword, and then it puts it in the dictionary. Examples of Lzw encoding will not be presented there. (If you want you can find some links at my home page) Decoding is also easy: every time you get a codeword output the string that it represents and then append the last string with the byte of the string you just outputted, so you form a new codeword. But there's a little exception (easy to handle) there's sometimes when the encoder sends you a codeword that isn't in the dictionary at all, then you just have to form this codeword, how? easy, get the last codeword, and his first byte, append them, put this codeword in the dictionary and output it. The first entry available is, in the case we have two elements, the third. So here it's the pseudo code, as you can imagine the first codeword must be handled different because then we yet don't have a last Code Word.

    Ok? This is all the decoding routine for Lzw, now let's put an example: String:  0101110
     
    Code  Output  New Cw  Last string  Current byte 
    0
    0
    ...
    ...
    ...
    1
    1
    01(3)
    0
    1
    3
    01
    10(4)
    1
    0
    1
    1
    011(5)
    01
    1
    4
    10
    11(6)
    1
    1

    And let's say we get a codeword that isn't in the dictionary: String:  1110
     
    Code  Output  New Cw 
    1
    1
    ...
    3
    11
    11(3)
    <- We form it 
    0
    0
    110(4)

    Ok? when we get the Cw 3 we yet haven't written it, so we get the last string "1" and append to it his first byte, '1' so we form the codeword with the string "11" and we can output it. Why this happens? as in all the compression schemes the decoder always is one step behind the encoder, the encoder is who says what to do, in that case that's because when the encoder outputs the Code Word 3 it has already putted in his table this Code Word.
     

    Gif decoding
    The gif implementation of lzw uses some especial code (zip implementation has some other codes, like one which tells when to increment the number of the bits for the codes) Those are:

    1. The Clear Code. It tells when to restart the WHOLE process.
    2. The End Of Information. It tells when there's no more codes to read.
    Their values? First we must know the bpp, (Bits Per Pixel) i.e.: in a 16 colors gif  the bpp is 4 (a pixel is represented with 4 bits). Once we know where the first available entry is (in this case 16 (10000b)) then the clear code is: 2^bpp and EOI is 2^bpp+1. So the first available entry for the decoder is 2^bpp+2. The Code words have a variable length, it depends on the maximum code word that there's in your dictionary, after you write a code that can't be hold with this number of bits, you increment the number of bits. I.e.: Let's say our codes are 5 bits long, then once we write the entry 31 (2^5=32) we know that may be the next code is 32 so we need to get 6 bits. The order of pixels is the following: left to right and top to bottom.
     

    The dictionary
    The dictionary in gif's lzw implementation has 4096 Code words, the maximum length of bits is 12 (2^12=4096). Each entry must hold any infinite number of bytes in a whole string. But how can we do it? The string of every entry may have any length possible, (we'll restrict this to 2^32: 4gb, so we'll have no problem at all ;-) and because we build them, they aren't pointers to the input file, so we must have a big buffer when we put our strings, we'll allocate memory for it only once. So the structure of our table will be:
     
     
    Name  Length  Meaning 
    Pointer 
    4
    It will point to the string in the buffer. 
    Length
    4
    The length of this string.

    Once you start you must init the code words for the possible bytes, if our images has 16 colors, then we have to initialize the first code word=0, the second=1, the third=2 till the sixteenth. The pseudo code looks like that:

    The entries [2^bpp] and [2^bpp+1] are the CC and the EOI, we'll skip and forget them. Don't worry, we'll never try to read them.

    Now let's talk about the length of the buffer which holds the strings. I've did some test with gif images and they tell that the number of bytes used is common between 20k-40k, but because we should now the limit. I tried it with some pure noise files and scanned images, the maximum number of bytes that any gif used was about 182k. If we don't want to have any problem we could set this buffer to something like 300k. This will be enough. Remember: you should keep track of the position in the buffer when you can write strings, so when you write an string you increment the pointer and then save it. (we don't write to overwrite other strings)
     

    Clear Code
    The Clear code is 2^bpp. For 16 color images it's equal to 16. For 255 color images is 256. How do you calculate this? easy with the fpu, look that:
     
    fild      gif_bpp
    fld1 
    fscale
    fistp    gif_colors
    fstp    dummy_dd
    ;the Bits Per Pixel 
     

    ;the result 
    ;discard st 

    I'll not explain this because I've already written some articles that you can download from my home page. When you get the Clear Code you must reinit everything but the pointer to the decompressed image. The length of the codes is set to the minimum, the table is reinitialized, the pointer to the strings buffer must be set to the start of it. As you remember the first code that you get must be handled in a different way because you don't have last code. So I recommend getting and outputting the first code when you get the Clear Code.
     

    Increment the length
    This must be done when you write a new Code Word. What I've did was the following: Calculating the maximum value that the current value of bits can hold, then every time you write a Code Word you check if it's the maximum, if it's then increment the length of the bits for code, and also calculate a new maximum value. For example, if you are currently getting 9 bits, then once you put in your table the code 511 you must increase the number of bits. Remember that you should NOT increment the length if you put the Code Word 4095, because the maximum length of bits is 12, in that case, you don't allocate new strings, nor increment the length, but you don't have to reinit the process, you should do it only when you get clear code.
     

    Gif format
    I'll only explain the minimum, so you can uncompress a gif, but nothing else, just the basics. This info is extracted from the File Formats Enciclopedia, it can also be found in the specification of gif. I also explain what every (needed) byte means. But first let's explain how a (simple) gif is divided. It first has the global header (the header), then the pal (global color map), and then the Image descriptor, and the end Id: ';' (03Bh) An animated gif will be the same, but just after the first image the second, and this till you find the end Id.

    Global header
    This is the main header:
     
    Global Header 
    Offset 
    Length  Kind  Description 
    0000h 
    6
    char
    ID (magic number) 
    0006h
    1
    word
    Image width
    0008h
    1
    word 
    Image height
    000Ah 
    1
    byte
    Bit mapped
    000Bh
    1
    byte 
    Color index of screen background 
    000Ch
    1
    byte
    Reserved

    Explanation:
     
    ID 

    Width 

    Height
    Bit
     

    Index
    Reserved 

    It may be: 'GIF87a' 'GIF89a' This is the version number, I work with the specification of '89a' however it's very different than '87a' if you only care about decompressing one image. 
    Erh.. ;-) In an animated gif this is like a big screen when the rest of the frames are drew on,   but for us it's the image width. 
  • Look above
  • It contains some information about the image, in bits: 

  •    0-2 -> Bpp-1 (so if it's 3 we have 4 bpp) 
       7 -> if one we have a global color map. 
  • The background of the image, useful for animated gifs, but not for us. 
  • That's the pixel aspect ratio, just don't worry about it. (in '89a') 
  • Global color map
    Also know as pal E-) It's 3*(2^bpp) bytes. It holds the RGB values of every indexed colour. I.e.: For a 16 colors image it's: 3*(2^4) = 48 bytes. Once you load this pal you can load directly to the screen. (I'll suppose you know how to do that ;-) Of course first there's the R value, then G B, and so on. Also note that the values are shifted to the left two times, so after putting any pal value in mode 13h (what I've used OE-) you should 'shr' it two times. Read it only if it the bit7 was 1. In our simple gifs, this pal will be always present.
     

    Image descriptor
    This is the image descriptor:
     
    Image descriptor
    Offset 
    Length  Kind  Description 
    0000h 
    1
    char
    ID 
    0001h
    1
    word
    Left offset of image
    0003h
    1
    word 
    Upper offset of image
    0005h 
    1
    word
    Width of image
    0007h
    1
    word
    Height of image 
    0009h
    1
    byte
    Palette description - bitmapped 

    Explanation:
     
    ID
    ...
    Byte 
  • This is the identification of the block: ',' (02Ch, 0x02C) 
  • The rest four words explain where the image in the global 'window' should be drawed. 
  • The last byte is just for animated gifs, we don't need it. Just do something with it: the Msb (the bit 7) tells if the following image has is own pal. Check it, if it's 0 exit with error. (this will avoid crashes with animated gifs E-) 
  • As you can see we can skip this block as long as the first byte is ','. Just after the Image descriptor there's the image data (unless there's a local pal!) The first byte that find here is the (minimum bits - 1) needed for decoding the image. (Save this because you'll need it when reiniting the whole process) You get this an save it as 'gif_min_bits' or something like that. Then here it comes the Data sub-blocks. I just think this is really STUPID. It doesn't help for error recovering at all. It just adds 1 byte for every 255, I don't know why it's used. But here it is the explanation: Once you get the minimum bits (1 byte) you start a loop. Read a byte, if it's 0 then break loop. If it's not, then read to the getbits buffer the value specified and repeat. Note that three bytes saved in this way are: 03h XXh XXh XXh 00h Also note that this process always ends when (and only when) you read the value 0.
     

    Getbits
    For the getbits you must have a buffer with the image data, wich you've previously readed (with the method explained above). The bit order? From right to left. Right part is right part of the code. E-) I.e.: You have the following 5 bits codes: '54321' 'A9876' 'FEDCB' then the ouput is:

  • First byte  MSB '87654321' LSB
  • Second byte  MSB 'XFEDCBA9' LSB  (X doesn't matter at all)
  • Doing a routine for getting those code shouldn't be very difficult. The only thing you have to do is keep track of how many bits you've writed to the output byte, and rotate the new bit as many positions as bits you've drawed. I.e.: Read the first code of the above example:
    1.  - Get bit 1 put in position 0. Output byte: '1'
    2.  - Get bit 2 put in position 1. Output byte: '21'
    3.  - Get bit 3 put in position 2. Output byte: '321'
    4.  - ...
    Ok? The way you may get the bits is trough the carry (with shifts) for the shifting based on the position a 'shl al,cl' with cl as the counter of the bits read, will be ok. This is the Getbits routine, but you should do a routine that gets the code. It'll draw as may bits as specified in a variable (gif_cur_bits, the number of bits which currently a Code Word has) and then return the code. (this may be the variable that keeps track of how many bits are need for the Code Words)
     

    Putting everything together
    The first code that you'll always get is the CC, so I recommend you starting getting a code. Because it will be CC you'll reinit the table and get the first code (as last) and then you can start the loop.

    Look at the part when we see if it's in the dictionary. What you should do is the following: keep track of what's the next code to write. This has a double purpose: Putting the next code word in the first available entry. And knowing if the encoder has outputted a value that still isn't in the dictionary. I.e.: If you have written 4 codewords then the value should be 5. So if you get a code word and is 5 you know that it isn't in the dictionary. (in this case no enconder will output a value above 5) Note that in this process you could have an error detection routine. If the value is above than the next available Code Word, then this means that the data is corrupted. The code of the reinit_process looks like that: The process of putting a new Code Word in the dictionary is the following:


    Nothing else, just put everything together in the language that you prefer, and you'll have a gif loader.
     

    Improvements
    The only improvements that you can achieve with a (lossless) decompressor is more speed. How to speed it up? first we should start doing a fast getbits, (see my article about optimizing the putbits) this will help, but the most important speed ups which can be done, are those of copying the phrases. May be you could use the fpu, or the Mmx, or copying at least 4 bytes at a time.
     

    Closing words
    We've learned a new scheme and a very used one, Lzw. Now we can try to handle Bw images, or animated gifs, or gifs with any possible resolution. Also if you feel you want, you can do a text compressor (lossless) with Lzw scheme. But remember that Lzw is patented, and you can only use it freely as long as your program isn't commercial. (Demo, intro or whatever) Hope you've learned it without problems, but if you had any, let me know.
     

    Contacting the author
    You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!
     
     

    Arturo San Emeterio Campos, Barcelona 23-07-1999

    This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.