Spring 2020

# Feb 7

Look at your results and discuss the "make some plots on a dataset of your choice" assignment.

## what's coming next

Our mission for the next few weeks is to cover the material in chapters 5 (statistics), 6 (probability), and 7 (inference).

There's a lot here and it's pretty mathy. Your goal should be to at at least understand the concepts in general. We'll get more practice with specific techniques as we go along, if and when we need them.

These chapters cover some of the same material that Matt's doing in his statistics course this term but without digging into as many details.

The ideas here are all aimed at trying to understand what we can conclude given a collection of numbers in a dataset.

Please read these chapters. I'll walk through some of the material in class; we'll see what time allows.

The assignment for next week will be some exercises that I'll construct, perhaps using the iris data that you saw in last class.

# overview

## chap 5 : statistics.py

Describing one set of numbers: mean, median, quantile, variance, standard_deviation

Describing the relation between two ordered lists of numbers: covariance, correlation

Discuss what each of these means.

See for example wikipedia: correlation and dependence

# Using the textbook python library :

from scratch.statistics import *

numbers = [1, 10, 2, 20, 3]
other_numbers = [1, 100, -3, 20, 4]

m = mean(numbers)
s = standard_deviation(numbers)

c = correlation(numbers, other_numbers)


## chap 6 : probability.py

starting concepts: events E, F , probabilities P(E), joint probability P(E,F) , conditional probability P(E|F), dependent vs independent

Bayes Theorem : what the relation is betwee P(E|F) from P(F|E) .

Bayesian vs Frequentist interpretations of what probability means.

In machine learning : if we have an initial guess for P(x) (perhaps of an email being spam), we can use more information like which words are in it to refine that to P(x|a,b,c) and be more accurate.

random variables and probability distributions : discrete, continuous

interesting well-known probability distributions :

• coin flip
• dice
• binomial
• uniform
• normal

PDF (probability distribution function), CDF (cumulative distribution function).

For example :

• Flip a coin 1000 times.
• What is the probability that the number of heads is something in the range 600 and 700?

Several approaches :

• Calculate exactly using the discrete binomial distribution.
• Calculate approximately using the continuous normal distribution.
• Simulate.
# from our textbook :

from scratch.probability import *

p1 = uniform_cdf(0.7)  # P(x < 0.7) when P(x)=1 for 0 <= x <= 1

p2 = normal_cdf(0.7)   # P(x < 0.7 when P(x) is normal ("bell curve") with mean=0, std_dev=1


## inference.py

... coming

https://cs.marlboro.college /cours /spring2020 /data /notes /feb7