Theory of Data Compression

Home

Theory

Lossless

Speech

Image

Download

Resources

Theory of Data Compression

Contents

Introduction and Background
Source Modeling
Entropy Rate of a Source
Shannon Lossless Source Coding Theorem
Rate-Distortion Theory
The Gap Between Theory and Practice
FAQs (Frequently Asked Questions)
References

I. Introduction and Background

Claude E. Shannon
Claude Shannon
In his 1948 paper, ``A Mathematical Theory of Communication,'' Claude E. Shannon formulated the theory of data compression. Shannon established that there is a fundamental limit to lossless data compression. This limit, called the entropy rate, is denoted by H. The exact value of H depends on the information source --- more specifically, the statistical nature of the source. It is possible to compress the source, in a lossless manner, with compression rate close to H. It is mathematically impossible to do better than H.

Shannon also developed the theory of lossy data compression. This is better known as rate-distortion theory. In lossy data compression, the decompressed data does not have to be exactly the same as the original data. Instead, some amount of distortion, D, is tolerated. Shannon showed that, for a given source (with all its statistical properties known) and a given distortion measure, there is a function, R(D), called the rate-distortion function. The theory says that if D is the tolerable amount of distortion, then R(D) is the best possible compression rate.

When the compression is lossless (i.e., no distortion or D=0), the best possible compression rate is R(0)=H (for a finite alphabet source). In other words, the best possible lossless compression rate is the entropy rate. In this sense, rate-distortion theory is a generalization of lossless data compression theory, where we went from no distortion (D=0) to some distortion (D>0).

Lossless data compression theory and rate-distortion theory are known collectively as source coding theory. Source coding theory sets fundamental limits on the performance of all data compression algorithms. The theory, in itself, does not specify exactly how to design and implement these algorithms. It does, however, provide some hints and guidelines on how to achieve optimal performance. In the following sections, we will described how Shannon modeled an information source in terms of a random process, we will present Shannon lossless source coding theorem, and we will discuss Shannon rate-distortion theory. A background in probability theory is recommended.

II. Source Modeling

Available at Amazon.com

Imagine that you go to the library. This library has a large selection of books --- say there are 100 million books in this library. Each book in this library is very thick --- say, for example, that each book has 100 million characters (or letters) in them. When you get to this library, you will, in some random manner, select a book to check out. This chosen book is the information source to be compressed. The compressed book is then stored on your zip disk to take home, or transmitted directly over the internet into your home, or whatever the case may be.

Mathematically, the book you select is denoted by

where

represents the whole book,

represents the first character in the book,

represents the second character, and so on. Even though in reality the length of the book is finite, mathematically we assume that it has infinite length. The reasoning is that the book is so long we can just imagine that it goes on forever. Furthermore, the mathematics turn out to be surprisingly simpler if we assume an infinite length book. To simplify things a little, let us assume that all the characters in all the books are either a lower-case letter (`a' through `z') or a SPACE (e. e. cummings style of writing shall we say). The source alphabet,

, is defined to be the set of all 27 possible values of the characters:

Now put yourself in the shoes of the engineer who designs the compression algorithm. She does not know in advance which book you will select. All she knows is that you will be selecting a book from this library. From her perspective, the characters in the book

are random variables which take values on the alphabet

. The whole book,

is just an infinite sequence of random variables -- that is

is a random process. There are several ways in which this engineer can model the statistical properties of the book.

Zero-Order Model: Each character is statistically independent of all other characters and the 27 possible values in the alphabet are equally likely to occur. If this model is accurate, then a typical opening of a book would look like this (all of these examples came directly from Shannon's 1948 paper):
xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd qpaamkbzaacibzlhjqd

This does not look like the writing of an intelligent being. In fact, it resembles the writing of a ``monkey sitting at a typewriter.''

First-Order Model: We know that in the English language some letters occur more frequently than others. For example, the letters `a' and `e' are more common than `q' and `z'. Thus, in this model, the character are still independent of one another, but the probability distribution of the characters are according to the first-order statistical distribution of English text. A typical text for this model looks like this:
ocroh hli rgwr nmielwis eu ll nbnesebya th eei alhenhttpa oobttva nah brl

Second-Order Model: The previous two models assumed statistical independence from one character to the next. This does not accurately reflect the nature of the English language. For exa#ple, some letters in thi# sent#nce are missi#g. However, we are still able to figure out what those letters should have been by looking at the context. This implies that there are some dependency between the characters. Naturally, characters which are in close proximity are more dependent than those that are far from each other. In this model, the present character depends on the previous character but it is conditionally independent of all previous characters . According to this model, the probability distribution of the character varies according to what the previous character is. For example, the letter `u' rarely occurs (probability=0.022). However, given that the previous character is `q', the probability of a `u' in the present character is much higher (probability=0.995). For a complete description, see the second-order statistical distribution of English text. A typical text for this model would look like this:
on ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at teasonare fuso tizin andy tobe seace ctisbe

Third-Order Model: This is an extension of the previous model. Here, the present character depends on the previous two characters but it is conditionally independent of all previous characters before those: . In this model, the distribution of varies according to what are. See the third-order statistical distribution of English text. A typical text for this model would look like this:
in no ist lat whey cratict froure birs grocid pondenome of demonstures of the reptagin is regoactiona of cre

The resemblance to ordinary English text increases quite noticeably at each of the above steps.

General Model: In this model, the book is an arbitrary stationary random process. The statistical properties of this model are too complex to be deemed practical. This model is interesting only from a theoretical point of view.

Model A above is a special case of Model B. Model B is a special case of Model C. Model C is a special case of Model D. Model D is a special case of Model E.

III. Entropy Rate of a Source

Shannon in 1948
Claude Shannon
The entropy rate of a source is a number which depends only on the statistical nature of the source. If the source has a simple model, then this number can be easily calculated. Here, we consider an arbitrary source:

while paying special attention to the case where

is English text.

Zero-Order Model: The characters are statistically independent of each other and every letter of the alphabet,, are equally likely to occur. Let be the size of the alphabet. In this case, the entropy rate is given by For English text, the alphabet size is m=27. Thus, if this had been an accurate model for English text, then the entropy rate would have been H=log₂ 27=4.75 bits/character.

First-Order Model: The characters are statistically independent. Let be the size of the alphabet and let be the probability of the i-th letter in the alphabet. The entropy rate is Using the first-order distribution, the entropy rate of English text would have been 4.07 bits/character had this been the correct model.

Second-Order Model: Let be the conditional probability that the present character is the j-th letter in the alphabet given that the previous character is the i-th letter. The entropy rate is Using the second-order distribution, the entropy rate of English text would have been 3.36 bits/character had this been the correct model.

Third-Order Model: Let be the conditional probability that the present character is the k-th letter in the alphabet given that the previous character is the j-th letter and the one before that is the i-th letter. The entropy rate is Using the third-order distribution, the entropy rate of English text would have been 2.77 bits/character had this been the correct model.
General Model: Let represents the first characters. The entropy rate in the general case is given by where the sum is over all possible values of . It is virtually impossible to calculate the entropy rate according to the above equation. Using a prediction method, Shannon has been able to estimate that the entropy rate of the 27-letter English text is 2.3 bits/character (see Shannon:Collected Papers).

We emphasize that there is only one entropy rate for a given source. All of the above definitions for the entropy rate are consistent with one another.

IV. Shannon Lossless Source Coding Theorem

Excellent Textbook

Shannon lossless source coding theorem is based on the concept of block coding. To illustrate this concept, we introduce a special information source in which the alphabet consists of only two letters:

Here, the letters `a' and `b' are equally likely to occur. However, given that `a' occurred in the previous character, the probability that `a' occurs again in the present character is 0.9. Similarly, given that `b' occurred in the previous character, the probability that `b' occurs again in the present character is 0.9. This is known as a binary symmetric Markov source.

An n-th order block code is just a mapping which assigns to each block of n consecutive characters a sequence of bits of varying length. The following examples illustrate this concept.

First-Order Block Code: Each character is mapped to a single bit.

Codeword

a 0.5 0

b 0.5 1

R=1 bit/character

An example:

Note that 24 bits are used to represent 24 characters --- an average of 1 bit/character.
Second-Order Block Code: Pairs of characters are mapped to either one, two, or three bits.

Codeword

aa 0.45 0

bb 0.45 10

ab 0.05 110

ba 0.05 111

R=0.825 bits/character

An example:

Note that 20 bits are used to represent 24 characters --- an average of 0.83 bits/character.
Third-Order Block Code: Triplets of characters are mapped to bit sequence of lengths one through six.

Codeword

aaa 0.405 0

bbb 0.405 10

aab 0.045 1100

abb 0.045 1101

bba 0.045 1110

baa 0.045 11110

aba 0.005 111110

bab 0.005 111111

R=0.68 bits/character

An example:

Note that 17 bits are used to represent 24 characters --- an average of 0.71 bits/character.

		Codeword
a	0.5	0
b	0.5	1
R=1 bit/character

		Codeword
aa	0.45	0
bb	0.45	10
ab	0.05	110
ba	0.05	111
R=0.825 bits/character

		Codeword
aaa	0.405	0
bbb	0.405	10
aab	0.045	1100
abb	0.045	1101
bba	0.045	1110
baa	0.045	11110
aba	0.005	111110
bab	0.005	111111
R=0.68 bits/character

Note that:

The rates shown in the tables are calculated from where is the length of the codeword for block .
The higher the order, the lower the rate (better compression).
The codes used in the above examples are Huffman codes.
We are only interested in lossless data compression code. That is, given the code table and given the compressed data, we should be able to rederive the original data. All of the examples given above are lossless.

Theorem: Let be the rate of an optimal n-th order lossless data compression code (in bits/character). Then

Since both upper and lower bounds of approach the entropy rate, H, as n goes to infinity, we have

Thus, the theorem established that the entropy rate is the rate of an optimal lossless data compression code. The limit exists as long as the source is stationary.

V. Rate-Distortion Theory

Available at Amazon.com

In lossy data compression, the decompressed data need not be exactly the same as the original data. Often, it suffices to have a reasonably close approximation. A distortion measure is a mathematical entity which specifies exactly how close the approximation is. Generally, it is a function which assigns to any two letters and in the alphabet a non-negative number denoted as

Here,

is the original data,

is the approximation, and

is the amount of distortion between

and

. The most common distortion measures are the Hamming distortion measure:

and the squared-error distortion measure (which is only applicable when

is a set of numbers):

Rate-distortion theory says that for a given source and a given distortion measure, there exists a function, R(D), called the rate-distortion function. The typical shape of R(D) looks like this:

If the source samples are independent of one another, the rate-distortion function can be obtained by solving the constrained minimization problem:

subject to the constraints that

and

where

is the distortion between the i-th and j-th letter in the alphabet. Using Blahut's algorithm, it may be possible to numerically calculate the rate-distortion function.

Rate-distortion theory is based on the concept of block coding (similar to above). A lossy block code is known as a vector quantizer (VQ). The block length n of the code is known as the VQ dimension.

Theorem: Assuming that the source is stationary and the source samples are independent. For every and for every , there exists a VQ of (sufficiently large) dimension n with distortion no greater than
and rate no more than Further more, there does not exist a VQ with distortion and rate less than .

In essence, the theorem states that R(D) is the rate-distortion performance of an optimal VQ. The above theorem can be generalized to the case where the source is stationary and ergodic.

VI. The Gap Between Theory and Practice

The theory holds when the block length n approaches infinity. In real-time compression, the compression algorithm must wait for n consecutive source samples before it can begin the compression. When n is large, this wait (or delay) may be too long. For example, in real-time speech compression, the speech signal is sampled at 8000 samples/second. If n is say 4000, then the compression delay is half a second. In a two-way conversation, this long delay may not be desirable.
The theory does not take into consideration the complexities associated with the compression and decompression operations. Typically, as the block length n increases, the complexities also increase. Often, the rate of increase is exponential in n.
The theory assumes that the statistical properties of the source is known. In practice, this information may not be available.
The theory assumes that there is no error in the compressed bit stream. In practice, due to noise in the communication channel or imperfection in the storage medium, there are errors in the compressed bit stream.

All of these problems have been successfully solved by researchers in the field.

VII. FAQs (Frequently Asked Questions)

What is the difference between lossless and lossy compression?
What is the difference between compression rate and compression ratio?
What is the difference between ``data compression theory'' and ``source coding theory''?
What does ``stationary'' mean?
What does ``ergodic'' mean?

What is the difference between lossless and lossy compression?
In lossless data compression, the compressed-then-decompressed data is an exact replication of the original data. On the other hand, in lossy data compression, the decompressed data may be different from the original data. Typically, there is some distortion between the original and reproduced signal.
The popular WinZip program is an example of lossless compression. JPEG is an example of lossy compression.

What is the difference between compression rate and compression ratio?
Historically, there are two main types of applications of data compression: transmission and storage. An example of the former is speech compression for real-time transmission over digital cellular networks. An example of the latter is file compression (e.g. Drivespace).
The term ``compression rate'' comes from the transmission camp, while ``compression ratio'' comes from the storage camp.
Compression rate is the rate of the compressed data (which we imagined to be transmitted in ``real-time''). Typically, it is in units of bits/sample, bits/character, bits/pixels, or bits/second. Compression ratio is the ratio of the size or rate of the original data to the size or rate of the compressed data. For example, if a gray-scale image is originally represented by 8 bits/pixel (bpp) and it is compressed to 2 bpp, we say that the compression ratio is 4-to-1. Sometimes, it is said that the compression ratio is 75%.
Compression rate is an absolute term, while compression ratio is a relative term.
We note that there are current applications which can be considered as both transmission and storage. For example, the above photograph of Shannon is stored in JPEG format. This not only saves storage space on the local disk, it also speeds up the delivery of the image over the internet.

What is the difference between ``data compression theory'' and ``source coding theory''?
There is no difference. They both mean the same thing. The term ``coding'' is a general term which could mean either ``data compression'' or ``error control coding''.

What does ``stationary'' mean?
Mathematically, a random process is stationary (or strict-sense stationary) if for every positive integers and the vectors:
and have the same probability distribution.
We can think of ``stationarity'' in terms of our library example. Suppose that we look at the first letter of all 100 million books and see how often the first letter is `a', how often it is `b', and so on. We will then get a probability distribution of the first letter of the (random) book. If we repeat this experiment for the fifth and 105th letters, we will get two other distributions. If all three distributions are the same, we are inclined to believe that the book process is stationary. To be sure, we should look at the joint distribution of the first and second letters and compare it with the joint distribution of the 101st and 102nd letters. If these two joint distributions are roughly the same, we will be more convinced that the book process is stationary.
In reality, the book process can not be stationary because the first character can not be a SPACE.

What does ``ergodic'' mean?
The strict mathematical definition of ergodicity is too complex to explain. However, Shannon offered the following intuitive explanation: ``In an ergodic process every sequence produced by the process is the same in statistical properties. Thus the letter frequencies, (the pairwise) frequencies, etc., obtained from particular sequences, will, as the lengths of the sequences increase, approach definite limits independent of the particular sequence. Actually this is not true of every sequence but the set for which it is false has probability zero. Roughly the ergodic property means statistical homogeneity.''

VIII. References

C. E. Shannon, A Mathematical Theory of Communication (free pdf version)
C. E. Shannon and W. Weaver, The Mathematical Theory of Communication. ($13 paper version)
C. E. Shannon, N. J. Sloane, and A. D. Wyner, Claude Elwood Shannon: Collected Papers.
C. E. Shannon, ``Prediction and Entropy of Printed English," available in Shannon: Collected Papers.
C. E. Shannon, ``Coding Theorems for a Discrete Source with a Fidelity Criterion," available in Shannon: Collected Papers.
T. M. Cover and J. A. Thomas, Elements of Information Theory.
R. Gallager, Information Theory and Reliable Communication.
A. Gersho and R. M. Gray, Vector Quantization and Signal Compression.
K. Sayood, Introduction to Data Compression.
M. Nelson and J.-L. Gailly, The Data Compression Book.
A. Leon-Garcia, Probability and Random Processes for Electrical Engineering.

(Student Solution Manual)

A. Papoulis, Probability, Random Variables, and Stochastic Processes.
M. R. Schroeder, Computer Speech: Recognition, Compression, Synthesis.
G. Held and T. R. Marshall, Data and Image Compression: Tools and Techniques.
D. Hankerson, P. D. Johnson, and G. A. Harris, Introduction to Information Theory and Data Compression .

Note: To anyone interested in information theory, I can highly recommend two books: Shannon's Collected Papers and Cover and Thomas' Elements of Information Theory. Collected Papers includes a 1987 interview of Shannon by OMNI Magazine, almost all of his papers on information theory and cryptography, his celebrated Master's Theses on A Symbolic Analysis of Relay and Switching Circuits, his Ph.D. dissertation on An Algebra for Theoretical Genetics, his paper on the Scientific Aspects of Juggling, his paper on A Mind Reading Machine, and much more. Elements of Information Theory is now the standard textbook on information theory. I have been using this textbook in my graduate-level information theory course for the past seven years. I have had nothing but positive feedback from students who have studied this book.

Nam Phamdo
Department of Electrical and Computer Engineering
State University of New York
Stony Brook, NY 11794-2350
phamdo@ieee.org