Jim Mahoney | Feb 13 2014 | MIT License

Suppose I flip a coin. What is the probability that it is heads? I'm sure you already know the answers: 0.5.

But let's be sure we know where that comes from.

There are two possible outcomes, and each are equally likely. The probability is the fraction of the outcomes which meet our condition.

In [6]:

```
coin = ['heads', 'tails']
head = [c for c in coin if c=='heads']
len(head)/len(coin) # probability of heads
```

Out[6]:

Now suppose I flip a coin and roll a six sided die. What is the probability that I get a heads and a 1?

The is a a $\frac{1}{6}$ chance of rolling a 6 since there are 6 equally likely outcomes.

The probability of both of these things happening is

$$ P(\text{heads}, 1) = \frac{1}{2} x \frac{1}{6} = P(\text{heads}) x P(1) $$You can confirm this by listing all the possibilities and counting.

However, this approach of just multiplying them works *only* if the two variables are independent, i.e. only if the value of one doesn't affect the varlue of the other.

Say we have some dominoes, i.e. things with two values on them, which I will represent by a collection of pairs.

In [7]:

```
dominoes = [(1, 'a'),
(1, 'b'),
(1, 'a'),
(2, 'a'),
(2, 'b'),
(2, 'b')
]
```

We can think of the number and the letter as two random variables. I'll call them *number* and *letter*.

By counting up those dominoes that match each condition, find the following.

- What is P(number=1) ?
- What is P(letter='a') ?
- What is P(1, 'a') ? (i.e. number=1
*and*letter='a')

Does the multiplication rule still work?

No. Because these two variables are *not independent*

To help understand this notion, we define a new kind of probality :

$$ P(x|y) = \text{probability of x given that we know y} $$To calculate this, first we get all the things where condition y is satisfied. Then using that subset of all possible things, we find how many of those have property x.

For example, suppose we want to find P(number=1 | letter=a).

First, find those dominoes where the letter is a. There are three of them : 1a, 1a, 2a. Then of those, count how many are 1. We get

$$ P(1|a) = \frac{2}{3} $$If then numbers and letters were independent, this would be the same as P(1) ... but it isn't.

- What is $ P(a|1) $ ?
- Can you write code to find this?

How do we find $P(x,y)$ when x and y are not independent?

It turns out that the following is always true :

$$ P(x, y) = P(y) \cdot P(x|y) = P(x) \cdot P(y|x) $$And from that, we can find way to get from P(x|y) to P(y|x).

(Quick quiz : is this consistent with the independent variable case?)

Bayes theorm follows directly from that last line :

$$ P(x | y) = \frac{P(y|x) \cdot P(x)}{P(y)} $$Sometimes we don't know P(y) directly. If we have all the conditional probabilities and all the values that x can take, we can get P(y) from

$$ P(y) = \sum_i P(y|x_i) * P(x_i) $$In a machine learning situation, we often want to find a way to calculate P(x) for some situation that we don't know the answer to.

For example, suppose we have a database of movies and user ratings for some of those movies. And suppose we know that Bob has rated 10 of the 10000 movies in the database. Can we predict whether or not he'll like one of the other movies?

We want P(bob_likes_movie_X) .

We could get a very rough guess by finding P(people_like_movie_X), without using any information about the person at all. But that won't give us a very good prediction about Bob.

What we can do is use the other information and the other people in the database to find out probabilities such as P(likes_X | likes_Y) i.e. the probability that they like movie X given that they liked movie Y, and then use what we know about the movies that Bob does like to make a better guess.

In other words, we build up a model of who likes what made of conditional probabilities, and then use what we know about a given person to predict what other movies that person is likely to like.

Systems like this are called "recommendation engines". And that's only one example.

Spam filters (i.e. probability of a message being spam given some properties of the message) are another example.

Word completion (how your phone gives you guesses as to what you're trying to type) is another example.

All these ideas depend on understanding how conditional probability works .... so it's worth putting some time in to stare at the examples.

We'll look at a machine learning example later - for now we're just trying to understand the probability notions.

In the intro python course we worked out how to count the words in Moby Dick

- Find the probability P(word) of words in Moby Dick. How much data is that to store?
- Find the conditional probabilities P(word | preceding_word) in Moby Dick.
- How much data is that to store?

- Could you find P(word | preceding_two_words) ?
- How much data would that be?
- Are there enough words in this corpus to make the counting statistics believable?

In [ ]:

```
```