-- Nov 11 notes --
* go over homework
* projects
* Overview of where we've been :
* statistics describing populations and samples of populations
mean
standard deviation
* binomial
a theoretical probability distribution
defined by p (prob of success), N (number of trials)
has mean=Np and sigma=Npq where q=1-p
* normal (or Gaussian)
another theoretical distribution
defined by mean (mu), standard deviation (sigma)
describes average of nearly anything in the long run
can be used to approximate binomial if Np>5 and Npq>5
can change scale to z=(x-mean)/sigma, then z has mean=0, sigma=1
* Hypothesis tests
H0 = null hypothesis
Ha = motivated hypothesis (one tail or two tail)
various tests:
difference of percentages
difference of means
comparing mean with given value
alpha = chosen significance level (0.05 is typical)
p-value = measured probability of result, or
cutoff = critical value of statistic
* Student's t-Distribution
different distribution for each "degrees of freedom" = N-1
use instead of normal when sample N is small
compensates for poor estimate of sigma
same as normal when N is big ( > 20 or so )
* New stuff
* Chi Square distribution
ChiSq = Z1**2 + Z2**2 + Z3**2 + ... + Zk**2
where each is normal with mean=0, sigma=1
with k degrees of freedom
always positive
expected value of ChiSq is k; much higher rejects null hypothesis
always one tailed
can see if model fits data
(i.e. expect y1,y2,y3,... but found x1,x2,x3,...)
used for a variety of purposes
* Chi Square test of independence
common in various surveys and biology
Example:
democrat republican indep. TOTALS
--------------------------------
straight men | | 10
gay men | | 72
straight women | | 65
gay women | | 13
--------------------------------
TOTALS: 80 50 30
H0: no relation between rows and columns
Method:
(a) calculate expected values for each table entry from percentages
(b) measure observed values for each
(c) calculate ChiSq = sum( (observe-expect)**2 / expect )
(d) k = degrees of freedom = (rows-1)*(columns-1)
(e) compare result with critical value from appendix
Note that for this formula, numbers must be frequencies (counts).
* F-distribution
F = ( ChiSq1/k1 ) / ( ChiSq2/k2 )
= ratio of Chi Square values
always positive
k = total degrees of freedom = k1 + k2
expected value is around 1
one tail (typically k big rejects H0) or
two tail (k big or k~0 rejects H0)
usually used to compare variances betwen populations
if sample one has (s1, N1) and sample two has (s1, N2)
then F = s1**2/s2**2 with k = (N1+N2-2) = degrees of freedom
* ANOVA - Analysis of Variance
similar to t-test means comparison, but more types of data
several variations, but underlying idea is similar :
uses variance to measure segmented population differences
very popular hypothesis test
"robust" : this works pretty well even if *not* normal populations!
why not means t-test?
If Ngroups=5, then number of pairs =5*4/2 = 10,
and so even at alpha=5% (1 in 20 chance),
you have a 50/50 chance of find a "significant" difference
The ANOVA, on the other hand, finds one *single* statistic.
Example: one-way (or one-factor) ANOVA
drug A B C D E
----------------------------
group 1 | | N = 30 people
2 | | 25
3 | | 32
4 | | 10
----------------------------
means: mA mB mC mD mE
variances: sA**2 sB**2 sC**2 sD**2 sE**2
H0: mean of each column is the same
method:
(a) estimate variance within treatments
sW**2 = ( sA**2 + sB**2 + sC**2 + sD**2 + sE**2 )
kW = degrees of freedom = ( 30+25+32+10 - 4 )
(b) estimate variance among treatments
sA**2 = 5 * variance(mA,mB,mC,mD,mE)
kA = degrees of freedom = 5-1 = 4
where 5 = number of drug treatments
(c) calculate F = sA**2/sW**2 with (kA+kW) degrees of freedom
and use F-table to compare with critical value
to see if we can reject the null hypothesis
Other variations: two-way (or two-factor)
* check out my online Resources for more places to read about this stuff,
especially Rice Virtual Lab