April 19
aside
schedule for the rest of the the term
If we can, I would like to cover:
- projects
- ANOVA test
- Chi squared test
- regression
... but there isn't much time left in the term. We'll see.
Tues Apr 19 quiz review Also chapter 5 stuff if time allows : paired data, t distribution
Thu Apr 21 quiz 2 Finish chap 5 discussion
Tue Apr 26 project check in ; ANOVA & Chi squared
Thu Apr 28 linear regression & term review
Tue May 3 share projects - writeups due Fri May 6
Final exam : take home exam, emailed to you noon Sun May 8, due noon Mon May 9.
quiz 2 topics
1. Normal distribution :
- What is it, what does it look like, how do you use it.
- What is the "z-score", and what are the probabilities associated with above/below z-scores.
- Review chapter 3 stuff, i.e. exercises 3.1, 3.5, 3.11
2. Estimating mean of population from a sample of size N
- "point estimate"
- confidence interval : standard error is sd/sqrt(N), +- ranges
- Review chap 4 , i.e. 4.5, 4.11
3. Hypothesis testing :
- What is null hypothesis (H0), alternative (HA), and "significance" (alpha)
- What are type-1 & type-2 errors
- one-sided vs two-sided tests
- when do and what does it mean to "reject the null hypothesis"?
- what a "pvalue" is, what it means, and how to connect with alpha
- using sample population mean technique :
- (1) Formulate experimental design (H0, HA, alpha)
- (2) Collect N data points.
- (3) From that, find result (sample mean) and standard_error ( sd/sqrt(N))
- (4) From those, find z and pvalue
- 4a) z = (result-H0)/error
- 4b) pvalue = probability of result "at least that extreme" (either 1 or 2 sided)
- 4c) compare pvalue with chosen significance to reach decision
- Review problems from chap 4, i.e. 4.23, 4.29
The R functions we care about are
- pnorm(z) = p, where p = robability that x <= z (for mean=0, sigma=1.0)
- qnorm(p) = z, inverse function from pnorm (which z has prob p of x <= z )
R recipe given some data values :
Say you are trying to see if some population mean
is larger than 4 (i.e. a one sided test).
> data = c(2, 3, 4, 10, 9, 7, 6, 5, 2, 3, 3, 7) # sample
> H0 = 4 # null hypothesis
> sigma = 0.05 # do a one sided
> result = mean(data) # sample mean = 5.08 = estimate of population mean
> N = length(data) # number of data points (12)
> error = sd(data)/sqrt(N) # standard error = 0.78 = estimate of sigma of sample mean
> z = (result - H0)/error # z score, i.e. how far result is from H0 = 1.38
> pvalue = 1 - pnorm(z) # one sided "greater than" pvalue = 0.0832
Here we fail to reject the null hypothesis since 0.08 is not smaller than 0.05.
That is, while this result is begger than 4, this result
is not unlikely enough to rule out the null hypothesis and random chance.
alcohol.csv
(I'm not going to put anything this complex on the exam.)
Here's the R recipe that works for me :
data = read.csv("alcohol.csv") # data frame
yes = subset(data, data$alcohol == "yes"); # drunk people data frame
no = subset(data, data$alcohol == "no"); # sober people data frame
mem_yes = yes$memory # drunk memory sample values
mem_no = no$memory # sober mem sample values
result = mean(mem_yes) - mean(mem_no) # observed result = -1.5
H0 = 0.0 # null hypothesis
error_yes = sd( mem_yes )/sqrt( length(mem_yes) )
error_no = sd( mem_yes )/sqrt( length(mem_yes) )
# combine to get overall error by formula for diff of two random variables :
error = sqrt( error_yes**2 + error_no**2) # result standard error
z = ( result - H0 ) / error # = - 0.6
pvalue = pnorm(z) # = 0.26
We cannot reject the null hypothesis, since we don't have pvalue < 0.05
chap 5 complications
If time allows :
- difference of means of data1 and data2 : error = sqrt(errror1**2 + error2**2)
- paired data (similar but different analysis ; just subtract pairwise and use one data set).
- t distribution (if N < 20, just use a different distribution : pt(z, df = N-1)).