readme.txt ---------- The files here are from an information theory course I taught in 2017. There are two data files to be analyzed, stream1.txt and stream2.txt, both ascii files of 0's and 1's, but with different statistics. Each file is treated as symbols made of (1, 2, 3, ...) consecutive bits. The tasks preformed are then (a) Compute theoretically how much each file can be compressed using the notion of "information entropy" .which is calculated using conditional probabilities. For example, the conditional probability P(1|01) is probability of getting a 1 after seeing 01. (b) Find a Huffman code that corresponds to the symbols in the file, and from that calculate the expected compression ratio. This doesn't actually compress the file - just uses some ideas from probability to calculate the expected compression. --- huffman -- The approach is to break each file up into non-overlapping symbols of fixed length, for example thinking of 00110001 as 00 11 00 01 or 'abac' where a=00 b=11 c=01. Then for different symbol lengths, the analysis * finds the probabilities (not conditional), * calculates an entropy, * builds a Huffman code, * and from the average length of the Huffman code finds a compression ratio for that encoding (ignoring the problem of writing out the huffman table in a real compression implementation) If for example it turns out that the huffman code and their probabilities are code length probability 0 1 0.5 10 2 0.3 101 3 0.1 111 3 0.1 then the average symbol length (in bits per symbol) would be 0.5 * 1 + 0.3 * 2 + 0.1 * 3 + 0.1 * 3 which would let us find how much the file could be compressed.