naive bayes
===========

(This is a summary of the method used in
chapter 13 of "Data Science from Scratch".)

Suppose we have several types of messages.
The typical example is "spam" and "not spam",
which we'll call "ham" as is usually done
for this problem.

We would like to classify new messages as one
or the other based on features of the message,
say which words appear in it. To keep things
simple for this discussion we'll assume

 * The overall probabilities of the two types
   of meesage are P(spam) = P(ham) = 0.5.
   
 * For each word[i] that we decide is interesting,
   and for each message[j] in the traininig set,
   we'll just look at whether the work is or isn't
   in that message.

So the training set data looks like this :

  word[0]  word[1]  word[2] ...  is_spam
  --------------------------------------
  yes      no       yes     ...  yes        message 0
  no       yes      yes     ...  no         message 1
  etc                                       

Here's a tiny example :

  spam = ['one alpha beta',       # spam training messages 
          'two beta',
	  'beta alpha',
	  'alpha',
	  'alpha beta']

  ham  = ['one two alpha',        # ham training messages
          'one two',
	  'two alpha',
	  'beta'
	  'one']

  words = ['one', 'two', 'alpha', 'beta']

or as a table :

  one  two   alpha beta  is_spam
  --------------------------------------
  yes   no   yes   yes   yes
  no    yes  no    yes   yes
  no    no   yes   yes   yes
  no    no   yes   no    yes
  no    no   yes   yes   yes
  yes   yes  yes   no    no
  yes   yes  no    no    no
  no    yes  yes   no    no
  no    no   no    yes   no
  yes   no   no    no    no

From that data we can find the conditional probabilities
for each word being in (yes) or not in (no) each category
of spam and ham messages

  -- conditional probabilities : take 1 --

  P(yes word[i] | spam) = count(spam messages with word[i])/count(spam)
  P(no  word[i] | spam) = count(spam messages without word[i])/count(spam)
  P(yes word[i] | ham)  = count(ham messages with word[i])/count(ham)
  P(no  word[i] | ham)  = count(ham messages without word[i])/count(ham)

by counting.

But since the math gets problematic if any of these are zero,
we instead put in an kludge by assuming that for each of these probabilities
there are an extra k messages with the word, and k messages without,
so that none of these counts are zero. That means an additional 2*k
messages (with and without) for each spam and ham formula.

 -- conditional probabilities smoothed : take 2 --    <===== (formula 1)

  P(yes word[i] | spam) = count(k + spams with word[i])/(2*k + count(spam))
  P(no  word[i] | spam) = count(k + spams without word[i])/(2*k + count(spam))
  P(yes word[i] | ham)  = count(k + hams with word[i])/(2*k + count(ham))
  P(no  word[i] | ham)  = count(k + hams without word[i])/(2*k + count(ham))

where k is just some small number we choose to keep everything non zero,
like k=1 or even k=0.5. (As long as you don't mind counting half a message...)

Now, given some test message, we look to see for each word[i] whether it is
or isn't in the message, and then want to know something like

 P(spam | yes word[0], no word[1], yes word[2], ...) 

Let's call that big ist of yes or no for each word "ws" (for the plural of w)..

The what we can use Baye's theorm to find an expression for what we want :

 P(spam | ws) * P(ws) = P(ws | spam) * P(spam)

and since P(ws) = 
if we sum over all the types of messages, we can re-write this as

 P(spam | ws) = P(ws | spam) * P(spam) / P(ws)

              = P(ws | spam) * P(spam)
                ---------------------------------------------
                P(ws | spam) * P(spam) + P(ws | ham) * P(ham)

Since we're assuming that P(spam)=P(ham)=0.5, those terms drop out leaving


 P(spam | ws) = P(ws | spam)
                --------------------------           <==== (formula 2)
                P(ws | spam) + P(ws | ham)

We're almost there.

Now for the "naive" step.

Remember that for the probabilities of several things, say events x and y,
we have that

 P(x, y) = "probability of both x1 and x2" = P(x) * P(y|x)

and IF x1 and and x2 are independent and don't depend on each other, this is

 P(x, y) = P(x) * P(y)          IF x and y are independent .

We want to calculate something like

 P(yes word[0], no word[1], yes word[2], ... | spam)

but what we have calculated from our training set is

 P(yes word[0] | spam)
 P(no word[1] | spam)
 etc

The "naive" step is assuming that all these yes or no probabilities
for each word are independent. They aren't. But we don't have real values
(things like P(yes word[1] | spam, no word[2], yes word[0], ...)),
and this "naive" version gives us a useful number even if it isn't right.

So we calculate what we want with
 
  P(ws | spam) 
  = P(yes word[0], no word[1], yes word[2], ... | spam)
  = P(yes word[0] | spam) * P(no word[1] | spam) * P(yes word[2] | spam) * ...

That right hand side is a product of many terms, one for each word
we think is interesing. And since many of these numbers are small,
the whole thing may be tiny - floating underflow tiny. We're going
to divide by another tiny number to get the answer, so the result
can still make sense ... but we need to make sure we don't round
off to zero accidently.

Our last math trick is to use the fact that

 a * b * c * d = exp(log(a) + log(b) + log(c) + log(d))

or

 P(ws | spam) = exp( sum_i( log( P(yes|no word[i] | spam) )))  <=== (3)

In other words, take the log of the tiny numbers, add them,
and then take the exponential. This avoids any floating
point underflow issues.

TO DO :

  Given a test message "alpha beta one two",
  what is the probability that this is spam?

  (a) Find all the conditional probabilities, namely
      P(yes|no word[i] | spam|ham)
      i.e. formula 1.

  (b) Decide which of those we need for the terms
      in formula 2.

  (c) Do the calculation for those terms using formula 3.

  (d) Find what we want, i.e. P(spam | test message).

  (e) Profit.