Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped.
Table of contents
Introduction
Today let's learn a very common scheme: lzw. We'll use it for decoding
a simple gif. As you can see I'll teach you only how to decode a simply
gif, not animated not interleaved, nor such things ;-) Just the basics,
and if you want more sure you'll easily learn it with all you've learn
there. I'll teach you how to decode 16 colors and 256 colors gifs. You
can ask: And what about 2 colors? And I'll answer, who cares about 2 colors
images? ;-) Our loader will successfully load simple gifs, but whit animated
gifs or two colors gif it will crash, however a simple gif loader suits
my needs and probably yours too.
Theory
Let's talk a little bit about compression history. In 1977 Jacob Lempel
and Abraham Ziv presented their dictionary based scheme Lz77, this leaded
up to Lzss presented by James Storer and Thomas Szymanski 1982 (Hey! this
is the year when I was born! ;-) And this is the scheme that today we use.
(pkzip or arj) Jacob and Abraham then thought more and the arrived to Lz78,
of course it was presented in 1978. This was also a dictionary based scheme,
but it always outputted codewords. And then Terry Welch based in Lz78 presented,
in 1984, Lzw. First it was made for implementation in hardware for high
performance disk controllers and then Compuserve used it in the famous
Gif, Graphics Interchange Format. (sm) Today it's very common on Inet,
because as everybody knows it compressed better than jpeg for images like
cartoons or drawings with computer that have a lot of repetitions, or a
colour which is the background repeated a lot.
Lzw decoding
Lzw is a dictionary based scheme, remember Lz77? (if you don't then
learn it, there's an article in my home page) it outputted bytes and codewords:
pairs of offsets and lengths. Lzw don't, lzw outputs always Codewords,
they refer to a dictionary that has 4096 entries. At the start it's filled
with the possible bytes (if 8 bits image, 0-255) then it starts to fill
the dictionary, every time it gets the last used Codeword, appends to it
the first byte of the current codeword, and then it puts it in the dictionary.
Examples of Lzw encoding will not be presented there. (If you want you
can find some links at my home page) Decoding is also easy: every time
you get a codeword output the string that it represents and then append
the last string with the byte of the string you just outputted, so you
form a new codeword. But there's a little exception (easy to handle) there's
sometimes when the encoder sends you a codeword that isn't in the dictionary
at all, then you just have to form this codeword, how? easy, get the last
codeword, and his first byte, append them, put this codeword in the dictionary
and output it. The first entry available is, in the case we have two elements,
the third. So here it's the pseudo code, as you can imagine the first codeword
must be handled different because then we yet don't have a last Code Word.
Code | Output | New Cw | Last string | Current byte |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
And let's say we get a codeword that isn't in the dictionary: String:
1110
Code | Output | New Cw | |
|
|
|
|
|
|
|
|
|
|
|
Ok? when we get the Cw 3 we yet haven't written it, so we get the last
string "1" and append to it his first byte, '1' so we form the codeword
with the string "11" and we can output it. Why this happens? as in all
the compression schemes the decoder always is one step behind the encoder,
the encoder is who says what to do, in that case that's because when the
encoder outputs the Code Word 3 it has already putted in his table this
Code Word.
Gif decoding
The gif implementation of lzw uses some especial code (zip implementation
has some other codes, like one which tells when to increment the number
of the bits for the codes) Those are:
The dictionary
The dictionary in gif's lzw implementation has 4096 Code words, the
maximum length of bits is 12 (2^12=4096). Each entry must hold any infinite
number of bytes in a whole string. But how can we do it? The string of
every entry may have any length possible, (we'll restrict this to 2^32:
4gb, so we'll have no problem at all ;-) and because we build them, they
aren't pointers to the input file, so we must have a big buffer when we
put our strings, we'll allocate memory for it only once. So the structure
of our table will be:
Name | Length | Meaning |
Pointer |
|
It will point to the string in the buffer. |
Length |
|
The length of this string. |
Once you start you must init the code words for the possible bytes, if our images has 16 colors, then we have to initialize the first code word=0, the second=1, the third=2 till the sixteenth. The pseudo code looks like that:
Now let's talk about the length of the buffer which holds the strings.
I've did some test with gif images and they tell that the number of bytes
used is common between 20k-40k, but because we should now the limit. I
tried it with some pure noise files and scanned images, the maximum number
of bytes that any gif used was about 182k. If we don't want to have any
problem we could set this buffer to something like 300k. This will be enough.
Remember: you should keep track of the position in the buffer when you
can write strings, so when you write an string you increment the pointer
and then save it. (we don't write to overwrite other strings)
Clear Code
The Clear code is 2^bpp. For 16 color images it's equal to 16. For
255 color images is 256. How do you calculate this? easy with the fpu,
look that:
fild gif_bpp
fld1 fscale fistp gif_colors fstp dummy_dd |
;the Bits Per Pixel
;the result
|
I'll not explain this because I've already written some articles that
you can download from my home page.
When you get the Clear Code you must reinit everything but the pointer
to the decompressed image. The length of the codes is set to the minimum,
the table is reinitialized, the pointer to the strings buffer must be set
to the start of it. As you remember the first code that you get must be
handled in a different way because you don't have last code. So I recommend
getting and outputting the first code when you get the Clear Code.
Increment the length
This must be done when you write a new Code Word. What I've did was
the following: Calculating the maximum value that the current value of
bits can hold, then every time you write a Code Word you check if it's
the maximum, if it's then increment the length of the bits for code, and
also calculate a new maximum value. For example, if you are currently getting
9 bits, then once you put in your table the code 511 you must increase
the number of bits. Remember that you should NOT increment the length if
you put the Code Word 4095, because the maximum length of bits is 12, in
that case, you don't allocate new strings, nor increment the length, but
you don't have to reinit the process, you should do it only when you get
clear code.
Gif format
I'll only explain the minimum, so you can uncompress a gif, but nothing
else, just the basics. This info is extracted from the File Formats Enciclopedia,
it can also be found in the specification of gif. I also explain what every
(needed) byte means. But first let's explain how a (simple) gif is divided.
It first has the global header (the header), then the pal (global color
map), and then the Image descriptor, and the end Id: ';' (03Bh) An animated
gif will be the same, but just after the first image the second, and this
till you find the end Id.
Global header
This is the main header:
|
|||
|
Length | Kind | Description |
|
|
|
ID (magic number) |
|
|
|
Image width |
|
|
|
Image height |
|
|
|
Bit mapped |
|
|
|
Color index of screen background |
|
|
|
Reserved |
Explanation:
ID
Width Height
Index
|
It may be: 'GIF87a' 'GIF89a' This is the version number, I work with
the specification of '89a' however it's very different than '87a' if you
only care about decompressing one image.
Erh.. ;-) In an animated gif this is like a big screen when the rest of the frames are drew on, but for us it's the image width. 0-2 -> Bpp-1 (so if it's 3 we have 4 bpp) 7 -> if one we have a global color map. |
Global color map
Also know as pal E-) It's 3*(2^bpp) bytes. It holds the RGB values
of every indexed colour. I.e.: For a 16 colors image it's: 3*(2^4) = 48
bytes. Once you load this pal you can load directly to the screen. (I'll
suppose you know how to do that ;-) Of course first there's the R value,
then G B, and so on. Also note that the values are shifted to the left
two times, so after putting any pal value in mode 13h (what I've used OE-)
you should 'shr' it two times. Read it only if it the bit7 was 1. In our
simple gifs, this pal will be always present.
Image descriptor
This is the image descriptor:
|
|||
|
Length | Kind | Description |
|
|
|
ID |
|
|
|
Left offset of image |
|
|
|
Upper offset of image |
|
|
|
Width of image |
|
|
|
Height of image |
|
|
|
Palette description - bitmapped |
Explanation:
ID
... Byte |
|
As you can see we can skip this block as long as the first byte is ','.
Just after the Image descriptor there's the image data (unless there's
a local pal!) The first byte that find here is the (minimum bits - 1) needed
for decoding the image. (Save this because you'll need it when reiniting
the whole process) You get this an save it as 'gif_min_bits' or something
like that. Then here it comes the Data sub-blocks. I just think this is
really STUPID. It doesn't help for error recovering at all. It just adds
1 byte for every 255, I don't know why it's used. But here it is the explanation:
Once you get the minimum bits (1 byte) you start a loop. Read a byte, if
it's 0 then break loop. If it's not, then read to the getbits buffer the
value specified and repeat. Note that three bytes saved in this way are:
03h XXh XXh XXh 00h Also note that this process always ends when (and only
when) you read the value 0.
Getbits
For the getbits you must have a buffer with the image data, wich you've
previously readed (with the method explained above). The bit order? From
right to left. Right part is right part of the code. E-) I.e.: You have
the following 5 bits codes: '54321' 'A9876' 'FEDCB' then the ouput is:
Doing a routine for getting those code shouldn't be very difficult. The only thing you have to do is keep track of how many bits you've writed to the output byte, and rotate the new bit as many positions as bits you've drawed. I.e.: Read the first code of the above example:First byte MSB '87654321' LSB Second byte MSB 'XFEDCBA9' LSB (X doesn't matter at all)
Putting everything
together
The first code that you'll always get is the CC, so I recommend you
starting getting a code. Because it will be CC you'll reinit the table
and get the first code (as last) and then you can start the loop.
Nothing else, just put everything together in the language that
you prefer, and you'll have a gif loader.
Improvements
The only improvements that you can achieve with a (lossless) decompressor
is more speed. How to speed it up? first we should start doing a fast getbits,
(see my article about optimizing the putbits) this will help, but the most
important speed ups which can be done, are those of copying the phrases.
May be you could use the fpu, or the Mmx, or copying at least 4 bytes at
a time.
Closing words
We've learned a new scheme and a very used one, Lzw. Now we can try
to handle Bw images, or animated gifs, or gifs with any possible resolution.
Also if you feel you want, you can do a text compressor (lossless) with
Lzw scheme. But remember that Lzw is patented, and you can only use it
freely as long as your program isn't commercial. (Demo, intro or whatever)
Hope you've learned it without problems, but if you had any, let me know.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.