Bayes Theorem, Hidden Markov Models, Machine Learning, Particle Filters, and all that .
Jim Mahoney | cs.marlboro.edu | Oct 2016 | MIT Licenses

Resources :

  * https://en.wikipedia.org/wiki/Conditional_probability
  * https://en.wikipedia.org/wiki/Bayes'_theorem
  * https://en.wikipedia.org/wiki/Hidden_Markov_model
  * https://en.wikipedia.org/wiki/Particle_filter
  * https://en.wikipedia.org/wiki/Recursive_Bayesian_estimation
  * http://nbviewer.jupyter.org/github/rlabbe/Kalman-and-Bayesian-Filters-in-Python/blob/master/table_of_contents.ipynb

This an attempt at a simple example that illustrates how these ideas tie together.

The example itself is inspired by the discussion in rlabbe's ipynb jupyter notebook.

 Suppose you are trying to figure out how much your weight (mass) by
 measuring yourself on a bathroom scale (bathscale) each day. However,
 your bathscale isn't all that accurate. In fact, it's terrible.

 In what follows, I'm using the words (mass, bathscale) rather
 than (weight, scale) because for problems like these those
 words can mean something else, namely a weighting factor
 and a multiplicative scaling factor.
 
 Let's start with a simplified discrete version of the problem:
 we'll only allow your possible masses to be (150lb, 155lb, 160lb).

 We'll also do some experiments ahead of time to learn how the bathscale
 behaves, by putting known mass (perhaps from a gym) on the bathscale,
 using repeated trials to measure their behavior, which we will
 describe with conditional P(bathscale|mass) probabilities, where the
 notation P(x|y) the probability of x given y.

   P(bathscale=150lb | mass=150lb) =  0.6
   P(bathscale=155lb | mass=150lb) =  0.4

   P(bathscale=150lb | mass=155lb) =  0.1
   P(bathscale=155lb | mass=155lb) =  0.65
   P(bathscale=160lb | mass=155lb) =  0.25

   P(bathscale=155lb | mass=160lb) =  0.3
   P(bathscale=160lb | mass=160lb) =  0.7

 (These particular numbers are invented out of thin air - just an
 arbitrarily probability distribution that has higher probability
 for the bathscale to show the correct mass.)

 Comparing this to the spam filter problem, this is equivalent to
 measuring the conditional probabilities P(word|spam) by countin
 words in a training corpus of spam and non-spam.

 Quick quiz:
  * Is there any relation between all the P(bathscale|mass)
    that have the same value for the mass?
  * Is there any relation between all the P(bathscale|mass)
    that have the same value for the bathscale?
  * Explain.
    
 Now we would like to learn something about our mass by taking a
 measurement with the bathscale.  In other words, we want to find
 P(mass=? | bathscale=something) after we weigh ourselves.

 But the way the math works out, we need to have an initial assumption
 (the priors) about how likely the possible masss are. For this
 example, we'll say that we have no reason to prefer any of the
 possible masss.

 Initial belief ("priors") : 

   P(mass=150lb) = P(mass=155lb) = P(mass=160lb) = 1/3 .

 Quick quiz:
  * Where did that 1/3 come from from?

 Now let's say we weigh ourselves with the bathscale and get

 Observation 1 :

   bathscale = 155lb.

 The next step is to use that observation to update our
 belief about how much we weigh, namely to find
 P(mass=150|bathscale=155), P(mass=155|bathscale=155).
 
 The math we need for this update is Bayes Theorem.

 Bayes Theorem :

   P(x|y) = (1/P(y)) * P(y|x) * P(x)

   (The derivation of this formula comes
    from P(x & y) = P(x) * P(y|x) = P(y) * P(x|y).)

 The way to think about this formula is that it lets us update the
 P(x) (our prior) given some new knowledge (i.e. y) by using a
 "weight factor" P(y|x) which takes into account how consistent our
 new knowledge (y) is with our previous belief (x).

 Typically the difficulty in applying this is that we don't know
 P(y). And our typical approach is get around this by using the fact
 that the P(x[i]|y) for different x[i] but only one y all have to
 add to 1.

 So, let C = 1/P(y) some constant scaling factor that we don't know yet.

 Then we use our new observation (bathscale=155lb) to update our current
 beliefs, i.e. P(mass=any)=1/3 .

 To save typing, I'm going to write mass as "m" and bathscale as "b".

   updated belief         masss which         old belief
                          describe
			  consistency
			  of new observation
			  with old belief
   ----------------       -------------       --------

   P(m=150 | b=155) = C * P(b=155|m=150) *    P(m=150)
   
   P(m=155 | b=155) = C * P(b=155|m=155) *    P(m=155)

   P(m=160 | b=155) = C * P(b=155|m=160) *    P(m=160)


 Putting in the numbers for the terms on the right gives

   P(m=150 | b=155) = C *  0.4           *    1/3
   
   P(m=155 | b=155) = C *  0.65          *    1/3

   P(m=160 | b=155) = C *  0.3           *    1/3

 and since these three probabilites must add to 1,

   C = 3 / (0.4 + 0.65 + 0.3) = 2.22

 which gives

   P(m=150 | b=155) = 2.22 * 0.4  * 1/3 = 0.296
   
   P(m=155 | b=155) = 2.22 * 0.65 * 1/3 = 0.482

   P(m=160 | b=155) = 2.22 * 0.3  * 1/3 = 0.222

 We can think of these three numbers as our new
 beliefs, namely that we still don't know how
 much we weigh, but 155lb is a bit more probable.

 Suppose that by somehow resetting this not-very-helpful
 bathscale (perhaps by picking it up and shaking it), we
 could weigh ourself again and still have the same
 odds of having the bathscale give the correct mass.

 Drop the " | b=155 " notation which described
 the first measurement, and just keep the current beliefs..

 Belief after one measurement 

   P(mass=150lb) = 0.296
   P(mass=155lb) = 0.482
   P(mass=160lb) = 0.222

 And now play the same game again.
 Suppose the bathscale now 150lb on the 2nd weighing.
 
 Observation 2 :

   bathscale = 150lb.

 Using C2 for the new constant

   updated belief          masss; consistency      previous
   with new measure        of previous to measure    belief
   ----------------        --------------            --------

   P(m=150 | b=150) = C2 * P(b=150|m=150) *          P(m=150)
   P(m=155 | b=150) = C2 * P(b=150|m=155) *          P(m=155)
   P(m=160 | b=150) = C2 * P(b=150|m=160) *          P(m=160)

which gives

   P(m=150 | b=150) = C2 * 0.6            *          0.296
   P(m=155 | b=150) = C2 * 0.1            *          0.482
   P(m=160 | b=150) = C2 * 0.0            *          0.222

Since these three probabilities need to sum to 1.0,
the constant C2 needs to be

   C2 = 1/ (0.6 * 0.296 + 0.1 * 0.482 + 0.0 * 0.222) = 4.429

and so we have
 
   P(m=150 | b=150) = 4.429 * 0.6 * 0.296 =  0.787
   P(m=155 | b=150) = 4.429 * 0.1 * 0.482 =  0.213
   P(m=160 | b=150) = 4.429 * 0.0 * 0.222 =  0.000

which would be give beliefs after two measurements
(dropping the "given" notation for the 2nd measurement).

   P(m=150) =  0.787
   P(m=155) =  0.213
   P(m=160) =  0.000

Note that as we have set things up, P(m=160) is now zero forever -
there isn't anything in this procedure that can change that. If we
really believe that if the bathscale measures 150 implies that the
mass cannot be 160, then this is the right conclusion. However, in
many real world situations it's better to have a tiny non-zero
probability so that if (say) the next dozen measurements were all
b=160, we wouldn't just crash. This sort of edge case glitch is not
uncommon.

It turns out that this approach can work pretty well even if the
numeric factors P(bathscale|mass) are innacurate. Remember that these
are the "how consistent is that" connections between the beliefs and
measurements. As long as we aren't trying to get all the way to the
right answer in one step, but instead are doing multiple passes that
converge on the answer, then all we need is to have those numeric
factors move us in the right direction. (This does however get harder
for higher dimension problems since there are so many more directions
to go.)

There are a number of variations of this scenario that could make
the situation realistic, leading to several math models.

 (1) Introduce the element of time.

 Rather then thinking this as repeated measurements
 of one true mass, we could be trying to track
 our daily mass by using the bathscale once each day.

 This would be a hidden markov model (HMM) : our true mass
 is the underlying state, which is not changed by
 the observations. Each day's mass is some
 (perhaps probabilistic) function of the previous
 day's mass, which we might guesstimate from
 some formula (i.e. we hope to be losing 1 lb per day).
 But all we see are the daily bathscale measurements,
 each of which only indicates probabilities of
 the true mass.

 The conceptual causality picture (as in many HMM descriptions)
 is a diagram like this. Each arrow indicates what causes what
 (probabilistically or otherwise).

   mass[day i]  ->    mass[day i+1]  -> mass[day i+2]

      |                  |                  |
      V                  V                  V

   bathscale[day i]   bathscale[i+1]    bathscale[day i+2]

 We observe a sequence of numbers for bathscale[j],
 and from that try to infer mass[j]. We use
 the Bayes Theorem to update P(mass|bathscale)
 at each time step, just at we ded before.

 (2) Make the variables continuous.

 Here we drop the simplifying assumption that the masses are rounded
 off to 5lb.

 If we kept things discrete but increased the resolution to, say, 1/10
 lb, and kept a range of 10 lb, then we would have 100 posssible masss
 and 100 possible bathscale readings. The (causual but probabilistic)
 connection between the true mass and the reading on the bathscale,
 P(bathscale|mass), is now a matrix with 100 * 100 values (ugh) and
 our belief is now P(mass) with 100 values.

 In the continuous limit, P(m) is a 1D function, and P(b|m) is a 2D
 function, and the formulas get messier, with integrals rather than
 sums.

 (3) Approximate the continuous case with particles.

 One particularly nice approach which keeps the math fairly simple but
 still gives reasonable results for realistic problems is to use
 "particles" to give an approximate sample of the continuous
 distribution.

 In this case, for example, that would mean inventing a large number
 of possible masss, with the probability indicated by how many
 masss are within a given range.

 So rather than having a belief model like the ones above, we
 instead have a list of masss (particles). If the masss
 were discrete, a list of 8 masss might be something like

   masss = [150, 155, 155, 155, 155, 155, 160, 160]

 which would mean P(150)=1/8, P(155)=5/8, P(160)=2/8.

 Then given an observation, for each particle we determine
 how consistent that observation with that particle's properties.
 In this case, that consistency is given by the P(bathscale|mass)
 conditional probabilities that we used for the update step`.
 We use that "importance" factor to update the list particles,
 using a probabilistic approach: if the particle is a good fit
 (high importance), we (probably) add more like it (i.e. near it)
 to the list. If the particle is a bad fit (low importance),
 we (probably) remove it from the list.

 This approach, like the previous ones, works best if we don't
 try to change things too much in one step (one observation)
 but instead gradually converge on the right answer.

 For this problem, with a list like the one above, the particle
 filter update could use the same P(b|m) "importance factors"
 (i.e. Bayesian weight factors). We want high P(b|m) (i.e. good fit)
 to mean more particles like that; low P(b|m) (i.e. bad fit)
 to mean fewer. So one ad hoc probablistic approach could be

   Given a particle with mass m and an bathscale observation b :
     # find "goodness of fit" weighting factor 
     # i.e. probability of getting observation b given particle m properties
     # big weight means more particles like this; small weight means fewer
     # If normalized, 0 <= weight <= 1
     weight = Pbm[b][m]   # pulling it from a matrix of values in discrete case
     x = rand()           # 0 <= x <= 1 random number
     if x < weight:
        # likely if weight is big
	add one more particle to list near x
     if x > weight:
        
 (4) Update world model from observations (if possible).

 In some situations it may be possible to update not just
 the markov state but also the P(b|m) probabilities themselves,
 using the observations as additional training data.

---------------------------------------------------------------------------