Arithmetic coding
Introduction
The main problem of lossless compression is to decompose a data set into a sequence of events. Those events are to be encoded using as few bits as possible.The idea is to give a shorter codewords to more probable events. The data set could be compressed whenever some events are more probable then other. There are set of distinct events e1, e2, e3, .., en wich togather are making probability distribution P. Shannon proved that the smallest possible expected numbers of bits needed to encode an event is entrophy of P (H(P)).
p(ek) is a probability of event ek. The Arithmetic coding with accurate probabily of events gives an optimal compression. Here comes two main problems that have to be solved. First problem is a pure implemetation of arithmetic coding. Second problem is statistical modeling used in accurately predicting probablities of events. The coder uses those events from statistical modeling.
Algorithm
With given probabilites of events, the algorithm works in these three steps.
- the coder subdivides current interval into subintervals, one for each event.
- The size of a subinterval is proportion to to the probablity that the event will be the next event in the field/file.the code selets the subinterval corresponding to the event that actually occurs next and makes it the new current interval
The size of final interval is equal to the product of the probabilties of all individual events that occurs in the input filed/file.
The description of algorithm in pseudocode for implementation.
set Low to 0
set High to 1
while there are input symbols do
take a symbol
CodeRange = High – Low
High = Low + CodeRange * HighRange(symbol)
Low = Low + CodeRange * LowRange(symbol)
end of while
output Low
The low and high parts of subintervals are subintervals of inteval [0,1). The size of each subinterval depends on the probability that this symbol occurs in the input field/file.
Example 1.
Table 1.
event |
probability of event |
a1 |
2/3 |
b1 |
1/3 |
a2 |
1 |
b2 |
1 |
a3 |
3/5 |
b3 |
2/5 |
The input field is given by these symbols: b1a2b3. The coding is given in table 2.
Table 2.
Action |
Subinterval |
begin | [0,1) |
subdivide current interval into new intervals of 2/3 and 1/3 of current interval | [0, 2/3), [2/3, 1) |
new current interval (event is b1): | [2/3, 1) |
subdivide current interval into new intervals of 1 and 1 of current interval | [2/3, 5/6), [5/6, 1) |
new current interval (event is a2): | [2/3, 5/6) |
subdivide current interval into new intervals of 1 and 1 of current interval | [2/3, 23/30), [23/30, 5/6) |
new current interval (event is b3): | [23/30, 5/6) |
output 11001 becouse that is the smallest binary number that lies within [23/30, 5/6) |
Picture 1.
The final interval is [23/30, 5/6). The size of this event is 1/15. That is the product of the sizes of probabilities events in the input filed
.
Example 2.
The message to be encoded is “ARITHMETIC”. There are ten symbols in the message. The probability distibution is given in the table 3.
Table 3.
Symbol |
Subinterval |
A |
1/10 |
C |
1/10 |
E |
1/10 |
H |
1/10 |
I |
2/10 |
M |
1/10 |
R |
1/10 |
T |
2/10 |
Each character is assigned the portion of the starting inteval [0,1). The size of inteval corresponds to symbol’s probability of appearance.
Table 4.
Symbol |
Subinterval |
A |
0,00 – 0,10 |
C |
0,10 – 0,20 |
E |
0,20 – 0,30 |
H |
0,30 – 0,40 |
I |
0,40 – 0,60 |
M |
0,60 – 0,70 |
R |
0,70 – 0,80 |
T |
0,80 – 1,00 |
The coding is performed in the table 5. Note, the most significant portion of a coded message belongs to the first symbol to be encoded. In this case, the first symbol is “A” which owns range [0, 0.1).
Table 5.
Symbol |
LowRange |
HighRange |
0,0 |
1,0 |
|
A |
0,0 |
0,1 |
R |
0,07 |
0,08 |
I |
0,074 |
0,076 |
T |
0,0756 |
0,076 |
H |
0,07572 |
0,07576 |
M |
0,07596 |
0,07600 |
E |
0,075968 |
0,075972 |
T |
0,0759712 |
0,075972 |
I |
0,07597152 |
0,07597168 |
C |
0,075971536 |
0,075971552 |
The output number is 0,075971536.
Decoding
The decoding is performed by a inverse procedure. The algorithm for decoding the incoming number looks like this:
get encoded number
do
find symbol whose range straddles the encoded number
output the symbol
range = symbo.LowValue – symbol.HighValue
substracti symbol.LowValue from encoded number
divide encoded number by range
until no more symbols
Note! The algorithm ignores the problem of how to decide when there are no more symbols to decode. See next chapter!
Problems
Ther are two ways how to decide when the decoder stops. This can be handled by encoding a special symbol EOF that occures only once in the file, or carrying the files length with the encoded message. The both ways add some extra bits.
The output from encoder is a very long floating point number. It seems like that we need a special standard for tha arithmetic coding. But, the arithmetic coding is best accomplished using standart integer math. It is called an incremental transmission sheme. The fixed size integer state variables receive new bits in at the low end and shift them out the high end. The output is a number al long as it has to be.
There are three way/models:
Copyright
The particular variant of arithmetic coding specified by the JPEG standard is subject to patents owned by IBM, AT&T, and Mitsubishi. You cannot legally use JPEG arithmetic coding unless you obtain licenses from these companies. Patent law's "experimental use" exception allows people to test a patented method in the context of scientific research, but any commercial or routine personal use is infringement.
HomePage -
Basic Facts -Commercial -
Algorithms -
Hardware -
FAQ - Glossary
Maintained and Copyrighted © 1997-2000 by DataCompression Reference Center