April 28
for the good of the order
Any questions about what's due and when it's due?
cramming in a bunch of new stuff at the very end
... we'll see how this goes.
STATISTICAL TESTS
1 : one or two means , from a sample
This is the stuff we've (mostly) already done.
1) H0: mean = ___ : sample size of N, standard error, normal ...
1.1) paired values : subtract & do the same
1.2) difference of two means: error = sqrt(error1**2 + error2**2)
1.3) N is small: use t-distribution with df=(N-1)
also see : https://en.wikipedia.org/wiki/Student's_t-test
(in R, the t-test collapses several of the steps for you)
2 : ANOVA
"Analysis of Variance"
The situation is that you have a more than 2 means to compare :
type_a type_b type_c ...
------ ------ ------
i=1 value value value
i=2 value value value
i=3 value value value
requirements
- standard deviation across types is similar
- each type is approximately normally
- all independent - no pairing, no built-in correlations
hypothesis framework
- H0 : means (of each column) are same
- HA : at least one mean is different
result :
- "F" statistic, whose distribution gives you a P value (Pr).
- may followup to see which mean is different:
In R, the data form is the "melted" version with all numeric values in one column.
R also needs a "model" of which factors are influencing the values.
# example : is one of several treatments (a,b,c) changing plant growth ?
# (numbers from wikipedia one-way anova)
> plantsframe = data.frame(a=c(6,8,4,5,3,4), b=c(8,12,9,11,6,8), c=c(13,9,11,8,7,12))
> plantsframe
a b c
1 6 8 13
2 8 12 9
3 4 9 11
4 5 11 8
5 3 6 7
6 4 8 12
> library(reshape)
> plants = melt(plantsframe)
> plants
variable value
1 a 6
2 a 8
3 a 4
4 a 5
5 a 3
6 a 4
7 b 8
8 b 12
9 b 9
10 b 11
11 b 6
12 b 8
13 c 13
14 c 9
15 c 11
16 c 8
17 c 7
18 c 12
> model = value ~ variable # one-factor (numeric ~ factor)
> plants_anova = aov(model, plants)
> summary(plants_anova)
Df Sum_Sq Mean_Sq F_value Pr(>F)
variable 2 84 42.00 9.265 0.0024 **
Residuals 15 68 4.53
Conclusion: Reject null. The numbers don't look like three samples from one population.
(probability of this result if that is true is 0.0024)
Typically the "factors" or "types" are not numeric but labeling fields,
while the values are all of the same form, i.e. "height of plant in inches"
There are variations depending on the model of whether the factors interact,
e.g. "two-way ANOVA"
And there are various "follow-up tests" once you see "yes, something is going on".
3 : Chi Squared
This one also has many variations
version 0 - sum of z scores
The underlying statistic is a sum of z scores,
testing to see if some set of observations
fit a given model.
model = (mean1, mean2, mean3) and (sd1, sd2, sd3)
observe : (value1, value2, value3)
where as before, each z = (value - mean)/sd
Then
chisq = sum( z1**2 + z2**2 + ...)
= sum of all the squared z scores
This is a well understood distribution which depends on df (degrees of freedom)
where df = N - 1 = n_model_parameters - 1
Again we get an overall pvalue to see how probable these
observations are given the null hypothesis.
(Here the null hypothesis is model is correct description of observed)
From my physics days, I would do this with given means & sds.
I'd find z explicitly, and then use
R's qchisq, pchisq functions - just like qnorm and pnorm
version 1 : numeric values are frequencies (counting something)
BUT more commonly this test is applied when the numbers are
frequency counts.
In that case, the standard deviation can
be guessed to be sqrt(n) if the model value is n.
(We haven't done this, but it can be shown from
the binomial distribution, i.e. flipping a probability p coin.)
And since this is the most common use, R has chisq.test()
which assumes frequencies and does most of the work for you.
example : does a jury fit (known) population racial percentages ?
(from textbook, section 6.3)
white black hispanic other
Population : 0.72 0.07 0.12 0.09
Jurors : 205 26 25 19
> chisq.test( c(205, 26, 25, 19), p=c(0.72, 0.07, 0.12, 0.09) )
Chi-squared test for given probabilities
data: c(205, 26, 25, 19)
X-squared = 5.8896, df = 3, p-value = 0.1171
(Note again that style of test this only works if the numbers are *counts* ,
and that therefore the standard deviations can be estimated automatically.)
If the probabilities aren't given, then they're assumed to be equal.
=== version 2 : testing independence ===
* https://en.wikipedia.org/wiki/Pearson's_chi-squared_test#Test_of_independence
It turns out that a minor variation of this technique can be used
to see if two factors are independent of each other in the sense of
probability variables being independent.
In this case, the model (averages and standard deviations) is built by
assuming that you can get probabilities by summing the rows and columns
to get p_columns and p_rows. The null hypothesis is that nothing else
(no correlation) is needed to explain the numbers.
<code>
> smoke = data.frame( smoke=c(7,87,12,9), nonsmoke=c(4,102,7,8))
> row.names(smoke) = c("heavy_exercise", "no_exercise", "some_exercise", "regular_exercise")
> smoke
smoke nonsmoke
heavy_exercise 7 4
no_exercise 87 102
some_exercise 12 7
regular_exercise 9 8
> chisq.test(smoke)
Pearson's Chi-squared test
data: smoke
X-squared = 3.2328, df = 3, p-value = 0.3571 ... => null hypothesis (independent) is OK.
4 : linear regression
This one is fairly straightforward : is there a trend
relating two lists of numbers x and y ? In other words,
on a plot is there a random smudge or a slanted blur?
# http://www.cyclismo.org/tutorial/R/linearLeastSquares.html
> year = c(2000 , 2001 , 2002 , 2003 , 2004)
> rate = c(9.34 , 8.50 , 7.62 , 6.93 , 6.60)
> cor(year, rate)
-0.9888 # correlation coefficient
> plot(year, rate) # see it
> fit = lm( rate ~ year) # linear model
> summary(fit)
Call:
lm(formula = rate ~ year)
Residuals:
1 2 3 4 5
0.132 -0.003 -0.178 -0.163 0.212
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1419.20800 126.94957 11.18 0.00153 **
year -0.70500 0.06341 -11.12 0.00156 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2005 on 3 degrees of freedom
Multiple R-squared: 0.9763, Adjusted R-squared: 0.9684
F-statistic: 123.6 on 1 and 3 DF, p-value: 0.001559
But do I need to know this stuff ?
Not until you need it - then you go look it up, eh?
The good news is that ideas we've been covering (pvalues, null hypothesis, etc)
carry over into these tests.
The bad news is that there are many of these sorts of
tests, and it's hard to know which of the questions
you might be able to ask of your data fit easily
into a commonly used test.
In my experience, it is not uncommon for people doing this stuff
for real to end up inventing their own test ... often by building
up a computer simulation and then trying to see if the observations
fit that model.
course evaluations
... handed out & done in class