-- Nov 11 notes -- * go over homework * projects * Overview of where we've been : * statistics describing populations and samples of populations mean standard deviation * binomial a theoretical probability distribution defined by p (prob of success), N (number of trials) has mean=Np and sigma=Npq where q=1-p * normal (or Gaussian) another theoretical distribution defined by mean (mu), standard deviation (sigma) describes average of nearly anything in the long run can be used to approximate binomial if Np>5 and Npq>5 can change scale to z=(x-mean)/sigma, then z has mean=0, sigma=1 * Hypothesis tests H0 = null hypothesis Ha = motivated hypothesis (one tail or two tail) various tests: difference of percentages difference of means comparing mean with given value alpha = chosen significance level (0.05 is typical) p-value = measured probability of result, or cutoff = critical value of statistic * Student's t-Distribution different distribution for each "degrees of freedom" = N-1 use instead of normal when sample N is small compensates for poor estimate of sigma same as normal when N is big ( > 20 or so ) * New stuff * Chi Square distribution ChiSq = Z1**2 + Z2**2 + Z3**2 + ... + Zk**2 where each is normal with mean=0, sigma=1 with k degrees of freedom always positive expected value of ChiSq is k; much higher rejects null hypothesis always one tailed can see if model fits data (i.e. expect y1,y2,y3,... but found x1,x2,x3,...) used for a variety of purposes * Chi Square test of independence common in various surveys and biology Example: democrat republican indep. TOTALS -------------------------------- straight men | | 10 gay men | | 72 straight women | | 65 gay women | | 13 -------------------------------- TOTALS: 80 50 30 H0: no relation between rows and columns Method: (a) calculate expected values for each table entry from percentages (b) measure observed values for each (c) calculate ChiSq = sum( (observe-expect)**2 / expect ) (d) k = degrees of freedom = (rows-1)*(columns-1) (e) compare result with critical value from appendix Note that for this formula, numbers must be frequencies (counts). * F-distribution F = ( ChiSq1/k1 ) / ( ChiSq2/k2 ) = ratio of Chi Square values always positive k = total degrees of freedom = k1 + k2 expected value is around 1 one tail (typically k big rejects H0) or two tail (k big or k~0 rejects H0) usually used to compare variances betwen populations if sample one has (s1, N1) and sample two has (s1, N2) then F = s1**2/s2**2 with k = (N1+N2-2) = degrees of freedom * ANOVA - Analysis of Variance similar to t-test means comparison, but more types of data several variations, but underlying idea is similar : uses variance to measure segmented population differences very popular hypothesis test "robust" : this works pretty well even if *not* normal populations! why not means t-test? If Ngroups=5, then number of pairs =5*4/2 = 10, and so even at alpha=5% (1 in 20 chance), you have a 50/50 chance of find a "significant" difference The ANOVA, on the other hand, finds one *single* statistic. Example: one-way (or one-factor) ANOVA drug A B C D E ---------------------------- group 1 | | N = 30 people 2 | | 25 3 | | 32 4 | | 10 ---------------------------- means: mA mB mC mD mE variances: sA**2 sB**2 sC**2 sD**2 sE**2 H0: mean of each column is the same method: (a) estimate variance within treatments sW**2 = ( sA**2 + sB**2 + sC**2 + sD**2 + sE**2 ) kW = degrees of freedom = ( 30+25+32+10 - 4 ) (b) estimate variance among treatments sA**2 = 5 * variance(mA,mB,mC,mD,mE) kA = degrees of freedom = 5-1 = 4 where 5 = number of drug treatments (c) calculate F = sA**2/sW**2 with (kA+kW) degrees of freedom and use F-table to compare with critical value to see if we can reject the null hypothesis Other variations: two-way (or two-factor) * check out my online Resources for more places to read about this stuff, especially Rice Virtual Lab