April 19

aside

every basketball shot of kobe bryant
game of thrones character affinity network
NY democratic primary forecast from fivethirtyeight.com

schedule for the rest of the the term

If we can, I would like to cover:

projects
ANOVA test
Chi squared test
regression

... but there isn't much time left in the term. We'll see.

 Tues Apr 19   quiz review   Also chapter 5 stuff if time allows : paired data, t distribution
 Thu  Apr 21   quiz 2        Finish chap 5 discussion
 Tue  Apr 26   project check in ; ANOVA & Chi squared
 Thu  Apr 28   linear regression & term review
 Tue  May  3   share projects - writeups due Fri May 6

Final exam : take home exam, emailed to you noon Sun May 8, due noon Mon May 9.

quiz 2 topics

1. Normal distribution :

What is it, what does it look like, how do you use it.
What is the "z-score", and what are the probabilities associated with above/below z-scores.
Review chapter 3 stuff, i.e. exercises 3.1, 3.5, 3.11

2. Estimating mean of population from a sample of size N

"point estimate"
confidence interval : standard error is sd/sqrt(N), +- ranges
Review chap 4 , i.e. 4.5, 4.11

3. Hypothesis testing :

What is null hypothesis (H0), alternative (HA), and "significance" (alpha)
What are type-1 & type-2 errors
one-sided vs two-sided tests
when do and what does it mean to "reject the null hypothesis"?
what a "pvalue" is, what it means, and how to connect with alpha
using sample population mean technique :
- (1) Formulate experimental design (H0, HA, alpha)
- (2) Collect N data points.
- (3) From that, find result (sample mean) and standard_error ( sd/sqrt(N))
- (4) From those, find z and pvalue
  - 4a) z = (result-H0)/error
  - 4b) pvalue = probability of result "at least that extreme" (either 1 or 2 sided)
  - 4c) compare pvalue with chosen significance to reach decision
Review problems from chap 4, i.e. 4.23, 4.29

The R functions we care about are

pnorm(z) = p, where p = robability that x <= z (for mean=0, sigma=1.0)

qnorm(p) = z, inverse function from pnorm (which z has prob p of x <= z )

R recipe given some data values :

Say you are trying to see if some population mean is larger than 4 (i.e. a one sided test).

 > data = c(2, 3, 4, 10, 9, 7, 6, 5, 2, 3, 3, 7)  # sample
 > H0 = 4                                         # null hypothesis
 > sigma = 0.05              #  do a one sided                       
 > result = mean(data)       #  sample mean = 5.08 = estimate of population mean
 > N = length(data)          #  number of data points (12)
 > error = sd(data)/sqrt(N)  #  standard error = 0.78 = estimate of sigma of sample mean
 > z = (result - H0)/error   #  z score, i.e. how far result is from H0 = 1.38
 > pvalue = 1 - pnorm(z)     #  one sided "greater than" pvalue = 0.0832

Here we fail to reject the null hypothesis since 0.08 is not smaller than 0.05. That is, while this result is begger than 4, this result is not unlikely enough to rule out the null hypothesis and random chance.

alcohol.csv

(I'm not going to put anything this complex on the exam.)

Here's the R recipe that works for me :

   data = read.csv("alcohol.csv")              # data frame
 
   yes = subset(data, data$alcohol == "yes");  # drunk people data frame
   no = subset(data, data$alcohol == "no");    # sober people data frame
 
   mem_yes = yes$memory                        # drunk memory sample values
   mem_no = no$memory                          # sober mem sample values
 
   result = mean(mem_yes) - mean(mem_no)       # observed result = -1.5
   H0 = 0.0                                    # null hypothesis
 
   error_yes = sd( mem_yes )/sqrt( length(mem_yes) )
   error_no = sd( mem_yes )/sqrt( length(mem_yes) )
   # combine to get overall error by formula for diff of two random variables :
   error = sqrt( error_yes**2 + error_no**2)   # result standard error
 
   z = ( result - H0 ) / error                 # = - 0.6
   pvalue = pnorm(z)                           # = 0.26

We cannot reject the null hypothesis, since we don't have pvalue < 0.05

chap 5 complications

If time allows :

difference of means of data1 and data2 : error = sqrt(errror1**2 + error2**2)
paired data (similar but different analysis ; just subtract pairwise and use one data set).
t distribution (if N < 20, just use a different distribution : pt(z, df = N-1)).

http://cs.marlboro.edu/ courses/ spring2016/statistics/ notes/ April_19
last modified Tuesday April 19 2016 8:38 am EDT

Statistics

course