April 28

for the good of the order

Any questions about what's due and when it's due?

cramming in a bunch of new stuff at the very end

... we'll see how this goes.

STATISTICAL TESTS

1 : one or two means , from a sample

This is the stuff we've (mostly) already done.


 1) H0: mean = ___   : sample size of N, standard error, normal ...

 1.1)  paired values : subtract & do the same
 1.2)  difference of two means: error = sqrt(error1**2 + error2**2)
 1.3)  N is small: use t-distribution with df=(N-1)
       also see : https://en.wikipedia.org/wiki/Student's_t-test
       (in R, the t-test collapses several of the steps for you)

2 : ANOVA

"Analysis of Variance"

https://en.wikipedia.org/wiki/Analysis_of_variance
https://en.wikipedia.org/wiki/F-test
http://personality-project.org/r/r.anova.html
section 5.5 in text

The situation is that you have a more than 2 means to compare :

           type_a     type_b    type_c  ...
	   ------     ------    ------
     i=1     value    value     value
     i=2     value    value     value
     i=3     value    value     value

requirements

standard deviation across types is similar
each type is approximately normally
all independent - no pairing, no built-in correlations

hypothesis framework

H0 : means (of each column) are same
HA : at least one mean is different

result :

"F" statistic, whose distribution gives you a P value (Pr).
may followup to see which mean is different:
- post-hoc (i.e. more test after first result) : Tukey
- https://en.wikipedia.org/wiki/Post_hoc_analysis

In R, the data form is the "melted" version with all numeric values in one column. R also needs a "model" of which factors are influencing the values.


# example : is one of several treatments (a,b,c) changing plant growth ?
# (numbers from wikipedia one-way anova)

> plantsframe = data.frame(a=c(6,8,4,5,3,4), b=c(8,12,9,11,6,8), c=c(13,9,11,8,7,12))
> plantsframe
      a   b  c
  1   6   8  13
  2   8  12   9
  3   4   9  11
  4   5  11   8
  5   3   6   7
  6   4   8  12
> library(reshape)
> plants = melt(plantsframe)
> plants
   variable value
1         a     6
2         a     8
3         a     4
4         a     5
5         a     3
6         a     4
7         b     8
8         b    12
9         b     9
10        b    11
11        b     6
12        b     8
13        c    13
14        c     9
15        c    11
16        c     8
17        c     7
18        c    12
> model = value ~ variable          # one-factor (numeric ~ factor)
> plants_anova = aov(model, plants) 
> summary(plants_anova)
            Df  Sum_Sq     Mean_Sq  F_value   Pr(>F)   
variable     2      84       42.00    9.265   0.0024  **
Residuals   15      68        4.53

Conclusion: Reject null. The numbers don't look like three samples from one population. (probability of this result if that is true is 0.0024)

Typically the "factors" or "types" are not numeric but labeling fields, while the values are all of the same form, i.e. "height of plant in inches"

There are variations depending on the model of whether the factors interact, e.g. "two-way ANOVA"

And there are various "follow-up tests" once you see "yes, something is going on".

3 : Chi Squared

This one also has many variations

version 0 - sum of z scores

The underlying statistic is a sum of z scores, testing to see if some set of observations fit a given model.

 model = (mean1, mean2, mean3) and (sd1, sd2, sd3)
 observe : (value1, value2, value3)

where as before, each z = (value - mean)/sd

Then

  chisq = sum(  z1**2 + z2**2 + ...)
        = sum of all the squared z scores

This is a well understood distribution which depends on df (degrees of freedom) where df = N - 1 = n_model_parameters - 1

Again we get an overall pvalue to see how probable these observations are given the null hypothesis. (Here the null hypothesis is model is correct description of observed)

From my physics days, I would do this with given means & sds. I'd find z explicitly, and then use R's qchisq, pchisq functions - just like qnorm and pnorm

version 1 : numeric values are frequencies (counting something)

BUT more commonly this test is applied when the numbers are frequency counts.

In that case, the standard deviation can be guessed to be sqrt(n) if the model value is n. (We haven't done this, but it can be shown from the binomial distribution, i.e. flipping a probability p coin.) And since this is the most common use, R has chisq.test() which assumes frequencies and does most of the work for you.


 example : does a jury fit (known) population racial percentages ?
 (from textbook, section 6.3)

                white    black   hispanic  other
 Population :   0.72     0.07    0.12      0.09
 Jurors     :   205      26      25        19

 > chisq.test( c(205, 26, 25, 19), p=c(0.72, 0.07, 0.12, 0.09) )

   Chi-squared test for given probabilities
   data:  c(205, 26, 25, 19)
   X-squared = 5.8896, df = 3, p-value = 0.1171

(Note again that style of test this only works if the numbers are *counts* ,
and that therefore the standard deviations can be estimated automatically.)

If the probabilities aren't given, then they're assumed to be equal.

=== version 2 : testing independence ===

* https://en.wikipedia.org/wiki/Pearson's_chi-squared_test#Test_of_independence 

It turns out that a minor variation of this technique can be used
to see if two factors are independent of each other in the sense of 
probability variables being independent.

In this case, the model (averages and standard deviations) is built by
assuming that you can get probabilities by summing the rows and columns
to get p_columns and p_rows. The null hypothesis is that nothing else
(no correlation) is needed to explain the numbers.

<code>
 > smoke = data.frame( smoke=c(7,87,12,9), nonsmoke=c(4,102,7,8))
 > row.names(smoke) = c("heavy_exercise", "no_exercise", "some_exercise", "regular_exercise")
 > smoke
                  smoke nonsmoke
 heavy_exercise       7        4
 no_exercise         87      102
 some_exercise       12        7
 regular_exercise     9        8
 > chisq.test(smoke)
	Pearson's Chi-squared test
 data:  smoke
 X-squared = 3.2328, df = 3, p-value = 0.3571   ... => null hypothesis (independent) is OK.

4 : linear regression

https://en.wikipedia.org/wiki/Regression_analysis

This one is fairly straightforward : is there a trend relating two lists of numbers x and y ? In other words, on a plot is there a random smudge or a slanted blur?


 # http://www.cyclismo.org/tutorial/R/linearLeastSquares.html
 > year = c(2000 ,   2001  ,  2002  ,  2003 ,   2004)
 > rate = c(9.34 ,   8.50  ,  7.62  ,  6.93  ,  6.60)
 > cor(year, rate)
 -0.9888             # correlation coefficient
 > plot(year, rate)  # see it
 > fit = lm( rate ~ year)     # linear model
 > summary(fit)
 
 Call:
 lm(formula = rate ~ year)

 Residuals:
      1      2      3      4      5 
  0.132 -0.003 -0.178 -0.163  0.212 

 Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
 (Intercept) 1419.20800  126.94957   11.18  0.00153 **
 year          -0.70500    0.06341  -11.12  0.00156 **
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
 Residual standard error: 0.2005 on 3 degrees of freedom
 Multiple R-squared:  0.9763,	Adjusted R-squared:  0.9684 
 F-statistic: 123.6 on 1 and 3 DF,  p-value: 0.001559

But do I need to know this stuff ?

Not until you need it - then you go look it up, eh?

The good news is that ideas we've been covering (pvalues, null hypothesis, etc) carry over into these tests.

The bad news is that there are many of these sorts of tests, and it's hard to know which of the questions you might be able to ask of your data fit easily into a commonly used test.

In my experience, it is not uncommon for people doing this stuff for real to end up inventing their own test ... often by building up a computer simulation and then trying to see if the observations fit that model.

https://en.wikipedia.org/wiki/Bootstrapping_%28statistics%29

course evaluations

... handed out & done in class

http://cs.marlboro.edu/ courses/ spring2016/statistics/ notes/ April_28
last modified Thursday April 28 2016 3:42 am EDT

Statistics

course

navigation