This directory has a few perl programs which
 look at the information content (shannon's entropy)
 of the text in Darwin's Origin of the Species.
 (I got the text in xml from from the Gutenberg project, 
 and stripped out the xml tags.)

 The files are

    ./firstorder.pl  > firstorder.out    # first order char frequencies
    ./secondorder.pl > secondorder.out   # second order char P(b|a)
    ./firstword.pl   > firstword.out     # first order word entropy
    
 For simplicity's sake, I use only 27 characters:
 the 26 lower case letters and the space, which 
 is assumed to seperate all words.  This is similar
 to the examples in Shannon's 1948 paper.

 The formulas are

    H = entropy per symbol 
      = - sum( p_i log2(p_i) )                 , symbols with prob p_i
      = -sum P(Ab) * log2( P(b|A) )            , precursor A to symbol b
     
    where P(b|A) = probability of b given preceding A
          P(Ab) = probability of string Ab
          A = a sequence occuring in the text to be analyzed

    By taking varying lengths for A we can analyze the text 
    to different "orders".
      0th order => all symbols have equal probability = Hmax
      1st order => p_i = probability of individual symbols
      2nd order => A = individual symbols, and we analyze pairs ab
      etc.

    With these in hand, redundancy = 1 - H/Hmax

 Results with characters (901931 total chars) :

     0th order :  H = log2(27) = 4.755
     1st order :  H = 4.1   =>  redundancy = 14 %
     2nd order :  H = 3.3   =>  redundancy = 31 %

 Results with words (154150 distinct words) :

     1st order words  : H = 17.2 => redun. = 46.8%

 The "compress" program reduces size of file 
 from 912k -> 340k = 37% of the original size.
 (This includes the punctuation and capitalization.  
  The shell commands were "cp darwin.txt d.txt; compress d.txt".)

 A large part of this compression is
 from the redunancy of the ascii character set, which uses
 too many bits to encode the few letters needed.  
 This "ascii redundancy" is in addition to that analyzed above.

 Each ascii character is encoded into 8 bits, which could allow
 up to 2^8=512 distinct characters.  Thus to encode only 27 
 characters we would have a signficance per bit of 
    log(27)/log(256) = 0.59
 
 Multiplying that by the 0.468 theoretical compression allowed of 
 the bare words and spaces would suggest that we could compress
 this text (without the punctuation) into 0.59 * 0.46 = 0.27 
 of its original size.  This is consistent with the "compress" 
 results, especially since our analysis didn't include the punctuation 
 characters, and was done on a very limited text sample.

 So there you are.

 Here's the next challenge : 

    Given an order N, (say N = 4 or 5 or so)
    generalize the secondorder.pl routine to calculate 
    the entropy to that order.

    *Don't* set all the hash entries to zero as I do; 
    the maximum number of 'tuples is about the same as the number
    of characters, which is large but manageable.  Wrapping the 
    text at the end as if it "preceded" the beginning can make things
    a bit more symmetric, but probably isn't worth the trouble.
    Wrapping each line around to the next may be worthwhile; I'm not
    sure.

    Then find the limit of H(n) as n increases...