naive bayes =========== (This is a summary of the method used in chapter 13 of "Data Science from Scratch".) Suppose we have several types of messages. The typical example is "spam" and "not spam", which we'll call "ham" as is usually done for this problem. We would like to classify new messages as one or the other based on features of the message, say which words appear in it. To keep things simple for this discussion we'll assume * The overall probabilities of the two types of meesage are P(spam) = P(ham) = 0.5. * For each word[i] that we decide is interesting, and for each message[j] in the traininig set, we'll just look at whether the work is or isn't in that message. So the training set data looks like this : word[0] word[1] word[2] ... is_spam -------------------------------------- yes no yes ... yes message 0 no yes yes ... no message 1 etc Here's a tiny example : spam = ['one alpha beta', # spam training messages 'two beta', 'beta alpha', 'alpha', 'alpha beta'] ham = ['one two alpha', # ham training messages 'one two', 'two alpha', 'beta' 'one'] words = ['one', 'two', 'alpha', 'beta'] or as a table : one two alpha beta is_spam -------------------------------------- yes no yes yes yes no yes no yes yes no no yes yes yes no no yes no yes no no yes yes yes yes yes yes no no yes yes no no no no yes yes no no no no no yes no yes no no no no From that data we can find the conditional probabilities for each word being in (yes) or not in (no) each category of spam and ham messages -- conditional probabilities : take 1 -- P(yes word[i] | spam) = count(spam messages with word[i])/count(spam) P(no word[i] | spam) = count(spam messages without word[i])/count(spam) P(yes word[i] | ham) = count(ham messages with word[i])/count(ham) P(no word[i] | ham) = count(ham messages without word[i])/count(ham) by counting. But since the math gets problematic if any of these are zero, we instead put in an kludge by assuming that for each of these probabilities there are an extra k messages with the word, and k messages without, so that none of these counts are zero. That means an additional 2*k messages (with and without) for each spam and ham formula. -- conditional probabilities smoothed : take 2 -- <===== (formula 1) P(yes word[i] | spam) = count(k + spams with word[i])/(2*k + count(spam)) P(no word[i] | spam) = count(k + spams without word[i])/(2*k + count(spam)) P(yes word[i] | ham) = count(k + hams with word[i])/(2*k + count(ham)) P(no word[i] | ham) = count(k + hams without word[i])/(2*k + count(ham)) where k is just some small number we choose to keep everything non zero, like k=1 or even k=0.5. (As long as you don't mind counting half a message...) Now, given some test message, we look to see for each word[i] whether it is or isn't in the message, and then want to know something like P(spam | yes word[0], no word[1], yes word[2], ...) Let's call that big ist of yes or no for each word "ws" (for the plural of w).. The what we can use Baye's theorm to find an expression for what we want : P(spam | ws) * P(ws) = P(ws | spam) * P(spam) and since P(ws) = if we sum over all the types of messages, we can re-write this as P(spam | ws) = P(ws | spam) * P(spam) / P(ws) = P(ws | spam) * P(spam) --------------------------------------------- P(ws | spam) * P(spam) + P(ws | ham) * P(ham) Since we're assuming that P(spam)=P(ham)=0.5, those terms drop out leaving P(spam | ws) = P(ws | spam) -------------------------- <==== (formula 2) P(ws | spam) + P(ws | ham) We're almost there. Now for the "naive" step. Remember that for the probabilities of several things, say events x and y, we have that P(x, y) = "probability of both x1 and x2" = P(x) * P(y|x) and IF x1 and and x2 are independent and don't depend on each other, this is P(x, y) = P(x) * P(y) IF x and y are independent . We want to calculate something like P(yes word[0], no word[1], yes word[2], ... | spam) but what we have calculated from our training set is P(yes word[0] | spam) P(no word[1] | spam) etc The "naive" step is assuming that all these yes or no probabilities for each word are independent. They aren't. But we don't have real values (things like P(yes word[1] | spam, no word[2], yes word[0], ...)), and this "naive" version gives us a useful number even if it isn't right. So we calculate what we want with P(ws | spam) = P(yes word[0], no word[1], yes word[2], ... | spam) = P(yes word[0] | spam) * P(no word[1] | spam) * P(yes word[2] | spam) * ... That right hand side is a product of many terms, one for each word we think is interesing. And since many of these numbers are small, the whole thing may be tiny - floating underflow tiny. We're going to divide by another tiny number to get the answer, so the result can still make sense ... but we need to make sure we don't round off to zero accidently. Our last math trick is to use the fact that a * b * c * d = exp(log(a) + log(b) + log(c) + log(d)) or P(ws | spam) = exp( sum_i( log( P(yes|no word[i] | spam) ))) <=== (3) In other words, take the log of the tiny numbers, add them, and then take the exponential. This avoids any floating point underflow issues. TO DO : Given a test message "alpha beta one two", what is the probability that this is spam? (a) Find all the conditional probabilities, namely P(yes|no word[i] | spam|ham) i.e. formula 1. (b) Decide which of those we need for the terms in formula 2. (c) Do the calculation for those terms using formula 3. (d) Find what we want, i.e. P(spam | test message). (e) Profit.