Statistics

Spring 2016
course
navigation

April 28

for the good of the order

Any questions about what's due and when it's due?

cramming in a bunch of new stuff at the very end

... we'll see how this goes.
STATISTICAL TESTS

1 : one or two means , from a sample

This is the stuff we've (mostly) already done.
1) H0: mean = ___ : sample size of N, standard error, normal ... 1.1) paired values : subtract & do the same 1.2) difference of two means: error = sqrt(error1**2 + error2**2) 1.3) N is small: use t-distribution with df=(N-1) also see : https://en.wikipedia.org/wiki/Student's_t-test (in R, the t-test collapses several of the steps for you)

2 : ANOVA

"Analysis of Variance"
The situation is that you have a more than 2 means to compare :
type_a type_b type_c ... ------ ------ ------ i=1 value value value i=2 value value value i=3 value value value
requirements
hypothesis framework
result :
In R, the data form is the "melted" version with all numeric values in one column. R also needs a "model" of which factors are influencing the values.
# example : is one of several treatments (a,b,c) changing plant growth ? # (numbers from wikipedia one-way anova) > plantsframe = data.frame(a=c(6,8,4,5,3,4), b=c(8,12,9,11,6,8), c=c(13,9,11,8,7,12)) > plantsframe a b c 1 6 8 13 2 8 12 9 3 4 9 11 4 5 11 8 5 3 6 7 6 4 8 12 > library(reshape) > plants = melt(plantsframe) > plants variable value 1 a 6 2 a 8 3 a 4 4 a 5 5 a 3 6 a 4 7 b 8 8 b 12 9 b 9 10 b 11 11 b 6 12 b 8 13 c 13 14 c 9 15 c 11 16 c 8 17 c 7 18 c 12 > model = value ~ variable # one-factor (numeric ~ factor) > plants_anova = aov(model, plants) > summary(plants_anova) Df Sum_Sq Mean_Sq F_value Pr(>F) variable 2 84 42.00 9.265 0.0024 ** Residuals 15 68 4.53
Conclusion: Reject null. The numbers don't look like three samples from one population. (probability of this result if that is true is 0.0024)
Typically the "factors" or "types" are not numeric but labeling fields, while the values are all of the same form, i.e. "height of plant in inches"
There are variations depending on the model of whether the factors interact, e.g. "two-way ANOVA"
And there are various "follow-up tests" once you see "yes, something is going on".

3 : Chi Squared

This one also has many variations

version 0 - sum of z scores

The underlying statistic is a sum of z scores, testing to see if some set of observations fit a given model.
model = (mean1, mean2, mean3) and (sd1, sd2, sd3) observe : (value1, value2, value3)
where as before, each z = (value - mean)/sd
Then
chisq = sum( z1**2 + z2**2 + ...) = sum of all the squared z scores
This is a well understood distribution which depends on df (degrees of freedom) where df = N - 1 = n_model_parameters - 1
Again we get an overall pvalue to see how probable these observations are given the null hypothesis. (Here the null hypothesis is model is correct description of observed)
From my physics days, I would do this with given means & sds. I'd find z explicitly, and then use R's qchisq, pchisq functions - just like qnorm and pnorm

version 1 : numeric values are frequencies (counting something)

BUT more commonly this test is applied when the numbers are frequency counts.
In that case, the standard deviation can be guessed to be sqrt(n) if the model value is n. (We haven't done this, but it can be shown from the binomial distribution, i.e. flipping a probability p coin.) And since this is the most common use, R has chisq.test() which assumes frequencies and does most of the work for you.
example : does a jury fit (known) population racial percentages ? (from textbook, section 6.3) white black hispanic other Population : 0.72 0.07 0.12 0.09 Jurors : 205 26 25 19 > chisq.test( c(205, 26, 25, 19), p=c(0.72, 0.07, 0.12, 0.09) ) Chi-squared test for given probabilities data: c(205, 26, 25, 19) X-squared = 5.8896, df = 3, p-value = 0.1171 (Note again that style of test this only works if the numbers are *counts* , and that therefore the standard deviations can be estimated automatically.) If the probabilities aren't given, then they're assumed to be equal. === version 2 : testing independence === * https://en.wikipedia.org/wiki/Pearson's_chi-squared_test#Test_of_independence It turns out that a minor variation of this technique can be used to see if two factors are independent of each other in the sense of probability variables being independent. In this case, the model (averages and standard deviations) is built by assuming that you can get probabilities by summing the rows and columns to get p_columns and p_rows. The null hypothesis is that nothing else (no correlation) is needed to explain the numbers. <code> > smoke = data.frame( smoke=c(7,87,12,9), nonsmoke=c(4,102,7,8)) > row.names(smoke) = c("heavy_exercise", "no_exercise", "some_exercise", "regular_exercise") > smoke smoke nonsmoke heavy_exercise 7 4 no_exercise 87 102 some_exercise 12 7 regular_exercise 9 8 > chisq.test(smoke) Pearson's Chi-squared test data: smoke X-squared = 3.2328, df = 3, p-value = 0.3571 ... => null hypothesis (independent) is OK.

4 : linear regression

This one is fairly straightforward : is there a trend relating two lists of numbers x and y ? In other words, on a plot is there a random smudge or a slanted blur?
# http://www.cyclismo.org/tutorial/R/linearLeastSquares.html > year = c(2000 , 2001 , 2002 , 2003 , 2004) > rate = c(9.34 , 8.50 , 7.62 , 6.93 , 6.60) > cor(year, rate) -0.9888 # correlation coefficient > plot(year, rate) # see it > fit = lm( rate ~ year) # linear model > summary(fit) Call: lm(formula = rate ~ year) Residuals: 1 2 3 4 5 0.132 -0.003 -0.178 -0.163 0.212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1419.20800 126.94957 11.18 0.00153 ** year -0.70500 0.06341 -11.12 0.00156 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2005 on 3 degrees of freedom Multiple R-squared: 0.9763, Adjusted R-squared: 0.9684 F-statistic: 123.6 on 1 and 3 DF, p-value: 0.001559

But do I need to know this stuff ?

Not until you need it - then you go look it up, eh?
The good news is that ideas we've been covering (pvalues, null hypothesis, etc) carry over into these tests.
The bad news is that there are many of these sorts of tests, and it's hard to know which of the questions you might be able to ask of your data fit easily into a commonly used test.
In my experience, it is not uncommon for people doing this stuff for real to end up inventing their own test ... often by building up a computer simulation and then trying to see if the observations fit that model.

course evaluations

... handed out & done in class
http://cs.marlboro.edu/ courses/ spring2016/statistics/ notes/ April_28
last modified Thursday April 28 2016 3:42 am EDT