Data
Science

Spring 2020
course
site

Feb 7

your work

Look at your results and discuss the "make some plots on a dataset of your choice" assignment.

what's coming next

Our mission for the next few weeks is to cover the material in chapters 5 (statistics), 6 (probability), and 7 (inference).

There's a lot here and it's pretty mathy. Your goal should be to at at least understand the concepts in general. We'll get more practice with specific techniques as we go along, if and when we need them.

These chapters cover some of the same material that Matt's doing in his statistics course this term but without digging into as many details.

The ideas here are all aimed at trying to understand what we can conclude given a collection of numbers in a dataset.

Please read these chapters. I'll walk through some of the material in class; we'll see what time allows.

The assignment for next week will be some exercises that I'll construct, perhaps using the iris data that you saw in last class.

overview

chap 5 : statistics.py

Describing one set of numbers: mean, median, quantile, variance, standard_deviation

Describing the relation between two ordered lists of numbers: covariance, correlation

Discuss what each of these means.

See for example wikipedia: correlation and dependence

# Using the textbook python library :

from scratch.statistics import *

numbers = [1, 10, 2, 20, 3]
other_numbers = [1, 100, -3, 20, 4]

m = mean(numbers)
s = standard_deviation(numbers)

c = correlation(numbers, other_numbers)

chap 6 : probability.py

starting concepts: events E, F , probabilities P(E), joint probability P(E,F) , conditional probability P(E|F), dependent vs independent

Bayes Theorem : what the relation is betwee P(E|F) from P(F|E) .

Bayesian vs Frequentist interpretations of what probability means.

In machine learning : if we have an initial guess for P(x) (perhaps of an email being spam), we can use more information like which words are in it to refine that to P(x|a,b,c) and be more accurate.

random variables and probability distributions : discrete, continuous

interesting well-known probability distributions :

PDF (probability distribution function), CDF (cumulative distribution function).

For example :

Several approaches :

# from our textbook :

from scratch.probability import *

p1 = uniform_cdf(0.7)  # P(x < 0.7) when P(x)=1 for 0 <= x <= 1

p2 = normal_cdf(0.7)   # P(x < 0.7 when P(x) is normal ("bell curve") with mean=0, std_dev=1

inference.py

... coming

https://cs.marlboro.college /cours /spring2020 /data /notes /feb7
last modified Tue April 23 2024 1:30 pm