Aithmetic coding

Arithmetic coding

Introduction

The main problem of lossless compression is to decompose a data set into a sequence of events. Those events are to be encoded using as few bits as possible.The idea is to give a shorter codewords to more probable events. The data set could be compressed whenever some events are more probable then other. There are set of distinct events e₁, e₂, e₃, .., e_nwich togather are making probability distribution P. Shannon proved that the smallest possible expected numbers of bits needed to encode an event is entrophy of P (H(P)).

p(e_k) is a probability of event e_k. The Arithmetic coding with accurate probabily of events gives an optimal compression. Here comes two main problems that have to be solved. First problem is a pure implemetation of arithmetic coding. Second problem is statistical modeling used in accurately predicting probablities of events. The coder uses those events from statistical modeling.

Algorithm

With given probabilites of events, the algorithm works in these three steps.

The coder starts with a “current interval” [H, L) set to [0,1).
For each events int the input filed/file, the coder performs an (a) and (b) step.

the coder subdivides current interval into subintervals, one for each event.

The size of a subinterval is proportion to to the probablity that the event will be the next event in the field/file.the code selets the subinterval corresponding to the event that actually occurs next and makes it the new current interval

The output is the last current interval. Bether to say, the output are so many bits that accurately distinguish the last current interval form all other final intervals.

The size of final interval is equal to the product of the probabilties of all individual events that occurs in the input filed/file.

The description of algorithm in pseudocode for implementation.

set Low to 0

set High to 1

while there are input symbols do

take a symbol

CodeRange = High – Low

High = Low + CodeRange * HighRange(symbol)

Low = Low + CodeRange * LowRange(symbol)

end of while

output Low

The low and high parts of subintervals are subintervals of inteval [0,1). The size of each subinterval depends on the probability that this symbol occurs in the input field/file.

Example 1.

Table 1.

event	probability of event
a1	2/3
b1	1/3
a2	1
b2	1
a3	3/5
b3	2/5

The input field is given by these symbols: b1a2b3. The coding is given in table 2.

Table 2.

Action	Subinterval
begin	[0,1)
subdivide current interval into new intervals of 2/3 and 1/3 of current interval	[0, 2/3), [2/3, 1)
new current interval (event is b1):	[2/3, 1)
subdivide current interval into new intervals of 1 and 1 of current interval	[2/3, 5/6), [5/6, 1)
new current interval (event is a2):	[2/3, 5/6)
subdivide current interval into new intervals of 1 and 1 of current interval	[2/3, 23/30), [23/30, 5/6)
new current interval (event is b3):	[23/30, 5/6)
output 11001 becouse that is the smallest binary number that lies within [23/30, 5/6)

Picture 1.

The final interval is [23/30, 5/6). The size of this event is 1/15. That is the product of the sizes of probabilities events in the input filed

Example 2.

The message to be encoded is “ARITHMETIC”. There are ten symbols in the message. The probability distibution is given in the table 3.

Table 3.

Symbol	Subinterval
A	1/10
C	1/10
E	1/10
H	1/10
I	2/10
M	1/10
R	1/10
T	2/10

Each character is assigned the portion of the starting inteval [0,1). The size of inteval corresponds to symbol’s probability of appearance.

Table 4.

Symbol	Subinterval
A	0,00 – 0,10
C	0,10 – 0,20
E	0,20 – 0,30
H	0,30 – 0,40
I	0,40 – 0,60
M	0,60 – 0,70
R	0,70 – 0,80
T	0,80 – 1,00

The coding is performed in the table 5. Note, the most significant portion of a coded message belongs to the first symbol to be encoded. In this case, the first symbol is “A” which owns range [0, 0.1).

Table 5.

Symbol	LowRange	HighRange
	0,0	1,0
A	0,0	0,1
R	0,07	0,08
I	0,074	0,076
T	0,0756	0,076
H	0,07572	0,07576
M	0,07596	0,07600
E	0,075968	0,075972
T	0,0759712	0,075972
I	0,07597152	0,07597168
C	0,075971536	0,075971552

The output number is 0,075971536.

Decoding

The decoding is performed by a inverse procedure. The algorithm for decoding the incoming number looks like this:

get encoded number

do

find symbol whose range straddles the encoded number

output the symbol

range = symbo.LowValue – symbol.HighValue

substracti symbol.LowValue from encoded number

divide encoded number by range

until no more symbols

Note! The algorithm ignores the problem of how to decide when there are no more symbols to decode. See next chapter!

Problems

Ther are two ways how to decide when the decoder stops. This can be handled by encoding a special symbol EOF that occures only once in the file, or carrying the files length with the encoded message. The both ways add some extra bits.

The output from encoder is a very long floating point number. It seems like that we need a special standard for tha arithmetic coding. But, the arithmetic coding is best accomplished using standart integer math. It is called an incremental transmission sheme. The fixed size integer state variables receive new bits in at the low end and shift them out the high end. The output is a number al long as it has to be.

There are three way/models:

fixed model – using fixed probabilities for all files; very low compression rate
semi-adaptive model – using a preminiary pass of the input file to gather statistics; second pass encodes the file; the table with probabilities has to be carried to the output file
adaptive model – dynamically estimating the probability of each event based on all events that precede it

Copyright

The particular variant of arithmetic coding specified by the JPEG standard is subject to patents owned by IBM, AT&T, and Mitsubishi. You cannot legally use JPEG arithmetic coding unless you obtain licenses from these companies. Patent law's "experimental use" exception allows people to test a patented method in the context of scientific research, but any commercial or routine personal use is infringement.

Books

References