This directory has a few perl programs which look at the information content (shannon's entropy) of the text in Darwin's Origin of the Species. (I got the text in xml from from the Gutenberg project, and stripped out the xml tags.) The files are ./firstorder.pl > firstorder.out # first order char frequencies ./secondorder.pl > secondorder.out # second order char P(b|a) ./firstword.pl > firstword.out # first order word entropy For simplicity's sake, I use only 27 characters: the 26 lower case letters and the space, which is assumed to seperate all words. This is similar to the examples in Shannon's 1948 paper. The formulas are H = entropy per symbol = - sum( p_i log2(p_i) ) , symbols with prob p_i = -sum P(Ab) * log2( P(b|A) ) , precursor A to symbol b where P(b|A) = probability of b given preceding A P(Ab) = probability of string Ab A = a sequence occuring in the text to be analyzed By taking varying lengths for A we can analyze the text to different "orders". 0th order => all symbols have equal probability = Hmax 1st order => p_i = probability of individual symbols 2nd order => A = individual symbols, and we analyze pairs ab etc. With these in hand, redundancy = 1 - H/Hmax Results with characters (901931 total chars) : 0th order : H = log2(27) = 4.755 1st order : H = 4.1 => redundancy = 14 % 2nd order : H = 3.3 => redundancy = 31 % Results with words (154150 distinct words) : 1st order words : H = 17.2 => redun. = 46.8% The "compress" program reduces size of file from 912k -> 340k = 37% of the original size. (This includes the punctuation and capitalization. The shell commands were "cp darwin.txt d.txt; compress d.txt".) A large part of this compression is from the redunancy of the ascii character set, which uses too many bits to encode the few letters needed. This "ascii redundancy" is in addition to that analyzed above. Each ascii character is encoded into 8 bits, which could allow up to 2^8=512 distinct characters. Thus to encode only 27 characters we would have a signficance per bit of log(27)/log(256) = 0.59 Multiplying that by the 0.468 theoretical compression allowed of the bare words and spaces would suggest that we could compress this text (without the punctuation) into 0.59 * 0.46 = 0.27 of its original size. This is consistent with the "compress" results, especially since our analysis didn't include the punctuation characters, and was done on a very limited text sample. So there you are. Here's the next challenge : Given an order N, (say N = 4 or 5 or so) generalize the secondorder.pl routine to calculate the entropy to that order. *Don't* set all the hash entries to zero as I do; the maximum number of 'tuples is about the same as the number of characters, which is large but manageable. Wrapping the text at the end as if it "preceded" the beginning can make things a bit more symmetric, but probably isn't worth the trouble. Wrapping each line around to the next may be worthwhile; I'm not sure. Then find the limit of H(n) as n increases...