readme.txt
----------

The files here are from an information theory course I taught in 2017.

There are two data files to be analyzed, stream1.txt and stream2.txt,
both ascii files of 0's and 1's, but with different statistics.

Each file is treated as symbols made of (1, 2, 3, ...) consecutive bits.

The tasks preformed are then

 (a) Compute theoretically how much each file can be compressed using
     the notion of "information entropy" .which is calculated using
     conditional probabilities. For example, the conditional
     probability P(1|01) is probability of getting a 1 after seeing 01.

 (b) Find a Huffman code that corresponds to the symbols in the file,
     and from that calculate the expected compression ratio.

This doesn't actually compress the file - just uses some ideas from
probability to calculate the expected compression.

--- huffman --

The approach is to break each file up into non-overlapping symbols of
fixed length, for example thinking of 00110001 as 00 11 00 01 or
'abac' where a=00 b=11 c=01.

Then for different symbol lengths, the analysis
 * finds the probabilities (not conditional),
 * calculates an entropy,
 * builds a Huffman code,
 * and from the average length of the Huffman
   code finds a compression ratio for that
   encoding (ignoring the problem of writing
   out the huffman table in a real compression
   implementation)

If for example it turns out that the huffman code and
their probabilities are

   code       length       probability
   0          1            0.5
   10         2            0.3
   101        3            0.1
   111        3            0.1

then the average symbol length (in bits per symbol) would be

   0.5 * 1 + 0.3 * 2 + 0.1 * 3 + 0.1 * 3

which would let us find how much the file could be compressed.