Bayes Theorem, Hidden Markov Models, Machine Learning, Particle Filters, and all that . Jim Mahoney | cs.marlboro.edu | Oct 2016 | MIT Licenses Resources : * https://en.wikipedia.org/wiki/Conditional_probability * https://en.wikipedia.org/wiki/Bayes'_theorem * https://en.wikipedia.org/wiki/Hidden_Markov_model * https://en.wikipedia.org/wiki/Particle_filter * https://en.wikipedia.org/wiki/Recursive_Bayesian_estimation * http://nbviewer.jupyter.org/github/rlabbe/Kalman-and-Bayesian-Filters-in-Python/blob/master/table_of_contents.ipynb This an attempt at a simple example that illustrates how these ideas tie together. The example itself is inspired by the discussion in rlabbe's ipynb jupyter notebook. Suppose you are trying to figure out how much your weight (mass) by measuring yourself on a bathroom scale (bathscale) each day. However, your bathscale isn't all that accurate. In fact, it's terrible. In what follows, I'm using the words (mass, bathscale) rather than (weight, scale) because for problems like these those words can mean something else, namely a weighting factor and a multiplicative scaling factor. Let's start with a simplified discrete version of the problem: we'll only allow your possible masses to be (150lb, 155lb, 160lb). We'll also do some experiments ahead of time to learn how the bathscale behaves, by putting known mass (perhaps from a gym) on the bathscale, using repeated trials to measure their behavior, which we will describe with conditional P(bathscale|mass) probabilities, where the notation P(x|y) the probability of x given y. P(bathscale=150lb | mass=150lb) = 0.6 P(bathscale=155lb | mass=150lb) = 0.4 P(bathscale=150lb | mass=155lb) = 0.1 P(bathscale=155lb | mass=155lb) = 0.65 P(bathscale=160lb | mass=155lb) = 0.25 P(bathscale=155lb | mass=160lb) = 0.3 P(bathscale=160lb | mass=160lb) = 0.7 (These particular numbers are invented out of thin air - just an arbitrarily probability distribution that has higher probability for the bathscale to show the correct mass.) Comparing this to the spam filter problem, this is equivalent to measuring the conditional probabilities P(word|spam) by countin words in a training corpus of spam and non-spam. Quick quiz: * Is there any relation between all the P(bathscale|mass) that have the same value for the mass? * Is there any relation between all the P(bathscale|mass) that have the same value for the bathscale? * Explain. Now we would like to learn something about our mass by taking a measurement with the bathscale. In other words, we want to find P(mass=? | bathscale=something) after we weigh ourselves. But the way the math works out, we need to have an initial assumption (the priors) about how likely the possible masss are. For this example, we'll say that we have no reason to prefer any of the possible masss. Initial belief ("priors") : P(mass=150lb) = P(mass=155lb) = P(mass=160lb) = 1/3 . Quick quiz: * Where did that 1/3 come from from? Now let's say we weigh ourselves with the bathscale and get Observation 1 : bathscale = 155lb. The next step is to use that observation to update our belief about how much we weigh, namely to find P(mass=150|bathscale=155), P(mass=155|bathscale=155). The math we need for this update is Bayes Theorem. Bayes Theorem : P(x|y) = (1/P(y)) * P(y|x) * P(x) (The derivation of this formula comes from P(x & y) = P(x) * P(y|x) = P(y) * P(x|y).) The way to think about this formula is that it lets us update the P(x) (our prior) given some new knowledge (i.e. y) by using a "weight factor" P(y|x) which takes into account how consistent our new knowledge (y) is with our previous belief (x). Typically the difficulty in applying this is that we don't know P(y). And our typical approach is get around this by using the fact that the P(x[i]|y) for different x[i] but only one y all have to add to 1. So, let C = 1/P(y) some constant scaling factor that we don't know yet. Then we use our new observation (bathscale=155lb) to update our current beliefs, i.e. P(mass=any)=1/3 . To save typing, I'm going to write mass as "m" and bathscale as "b". updated belief masss which old belief describe consistency of new observation with old belief ---------------- ------------- -------- P(m=150 | b=155) = C * P(b=155|m=150) * P(m=150) P(m=155 | b=155) = C * P(b=155|m=155) * P(m=155) P(m=160 | b=155) = C * P(b=155|m=160) * P(m=160) Putting in the numbers for the terms on the right gives P(m=150 | b=155) = C * 0.4 * 1/3 P(m=155 | b=155) = C * 0.65 * 1/3 P(m=160 | b=155) = C * 0.3 * 1/3 and since these three probabilites must add to 1, C = 3 / (0.4 + 0.65 + 0.3) = 2.22 which gives P(m=150 | b=155) = 2.22 * 0.4 * 1/3 = 0.296 P(m=155 | b=155) = 2.22 * 0.65 * 1/3 = 0.482 P(m=160 | b=155) = 2.22 * 0.3 * 1/3 = 0.222 We can think of these three numbers as our new beliefs, namely that we still don't know how much we weigh, but 155lb is a bit more probable. Suppose that by somehow resetting this not-very-helpful bathscale (perhaps by picking it up and shaking it), we could weigh ourself again and still have the same odds of having the bathscale give the correct mass. Drop the " | b=155 " notation which described the first measurement, and just keep the current beliefs.. Belief after one measurement P(mass=150lb) = 0.296 P(mass=155lb) = 0.482 P(mass=160lb) = 0.222 And now play the same game again. Suppose the bathscale now 150lb on the 2nd weighing. Observation 2 : bathscale = 150lb. Using C2 for the new constant updated belief masss; consistency previous with new measure of previous to measure belief ---------------- -------------- -------- P(m=150 | b=150) = C2 * P(b=150|m=150) * P(m=150) P(m=155 | b=150) = C2 * P(b=150|m=155) * P(m=155) P(m=160 | b=150) = C2 * P(b=150|m=160) * P(m=160) which gives P(m=150 | b=150) = C2 * 0.6 * 0.296 P(m=155 | b=150) = C2 * 0.1 * 0.482 P(m=160 | b=150) = C2 * 0.0 * 0.222 Since these three probabilities need to sum to 1.0, the constant C2 needs to be C2 = 1/ (0.6 * 0.296 + 0.1 * 0.482 + 0.0 * 0.222) = 4.429 and so we have P(m=150 | b=150) = 4.429 * 0.6 * 0.296 = 0.787 P(m=155 | b=150) = 4.429 * 0.1 * 0.482 = 0.213 P(m=160 | b=150) = 4.429 * 0.0 * 0.222 = 0.000 which would be give beliefs after two measurements (dropping the "given" notation for the 2nd measurement). P(m=150) = 0.787 P(m=155) = 0.213 P(m=160) = 0.000 Note that as we have set things up, P(m=160) is now zero forever - there isn't anything in this procedure that can change that. If we really believe that if the bathscale measures 150 implies that the mass cannot be 160, then this is the right conclusion. However, in many real world situations it's better to have a tiny non-zero probability so that if (say) the next dozen measurements were all b=160, we wouldn't just crash. This sort of edge case glitch is not uncommon. It turns out that this approach can work pretty well even if the numeric factors P(bathscale|mass) are innacurate. Remember that these are the "how consistent is that" connections between the beliefs and measurements. As long as we aren't trying to get all the way to the right answer in one step, but instead are doing multiple passes that converge on the answer, then all we need is to have those numeric factors move us in the right direction. (This does however get harder for higher dimension problems since there are so many more directions to go.) There are a number of variations of this scenario that could make the situation realistic, leading to several math models. (1) Introduce the element of time. Rather then thinking this as repeated measurements of one true mass, we could be trying to track our daily mass by using the bathscale once each day. This would be a hidden markov model (HMM) : our true mass is the underlying state, which is not changed by the observations. Each day's mass is some (perhaps probabilistic) function of the previous day's mass, which we might guesstimate from some formula (i.e. we hope to be losing 1 lb per day). But all we see are the daily bathscale measurements, each of which only indicates probabilities of the true mass. The conceptual causality picture (as in many HMM descriptions) is a diagram like this. Each arrow indicates what causes what (probabilistically or otherwise). mass[day i] -> mass[day i+1] -> mass[day i+2] | | | V V V bathscale[day i] bathscale[i+1] bathscale[day i+2] We observe a sequence of numbers for bathscale[j], and from that try to infer mass[j]. We use the Bayes Theorem to update P(mass|bathscale) at each time step, just at we ded before. (2) Make the variables continuous. Here we drop the simplifying assumption that the masses are rounded off to 5lb. If we kept things discrete but increased the resolution to, say, 1/10 lb, and kept a range of 10 lb, then we would have 100 posssible masss and 100 possible bathscale readings. The (causual but probabilistic) connection between the true mass and the reading on the bathscale, P(bathscale|mass), is now a matrix with 100 * 100 values (ugh) and our belief is now P(mass) with 100 values. In the continuous limit, P(m) is a 1D function, and P(b|m) is a 2D function, and the formulas get messier, with integrals rather than sums. (3) Approximate the continuous case with particles. One particularly nice approach which keeps the math fairly simple but still gives reasonable results for realistic problems is to use "particles" to give an approximate sample of the continuous distribution. In this case, for example, that would mean inventing a large number of possible masss, with the probability indicated by how many masss are within a given range. So rather than having a belief model like the ones above, we instead have a list of masss (particles). If the masss were discrete, a list of 8 masss might be something like masss = [150, 155, 155, 155, 155, 155, 160, 160] which would mean P(150)=1/8, P(155)=5/8, P(160)=2/8. Then given an observation, for each particle we determine how consistent that observation with that particle's properties. In this case, that consistency is given by the P(bathscale|mass) conditional probabilities that we used for the update step`. We use that "importance" factor to update the list particles, using a probabilistic approach: if the particle is a good fit (high importance), we (probably) add more like it (i.e. near it) to the list. If the particle is a bad fit (low importance), we (probably) remove it from the list. This approach, like the previous ones, works best if we don't try to change things too much in one step (one observation) but instead gradually converge on the right answer. For this problem, with a list like the one above, the particle filter update could use the same P(b|m) "importance factors" (i.e. Bayesian weight factors). We want high P(b|m) (i.e. good fit) to mean more particles like that; low P(b|m) (i.e. bad fit) to mean fewer. So one ad hoc probablistic approach could be Given a particle with mass m and an bathscale observation b : # find "goodness of fit" weighting factor # i.e. probability of getting observation b given particle m properties # big weight means more particles like this; small weight means fewer # If normalized, 0 <= weight <= 1 weight = Pbm[b][m] # pulling it from a matrix of values in discrete case x = rand() # 0 <= x <= 1 random number if x < weight: # likely if weight is big add one more particle to list near x if x > weight: (4) Update world model from observations (if possible). In some situations it may be possible to update not just the markov state but also the P(b|m) probabilities themselves, using the observations as additional training data. ---------------------------------------------------------------------------