Entry for April 22, 2013
So for this week, I've been trying to wrap up the material I feel I haven't gotten through yet. They are time series, methods of exploring data in R, some tangents into exploring probability theory with R, and various other ways of exploring what the Marlboro data set has to offer (namely nonparametric methods and statistical learning). Some of the R code is posted below, as well as the data set I did most of it with (export2.csv).
- I also have a new course description.
I'll go through each topic one by one.
Statistical Learning
In this section, I revisit the Elements of Statistical Learning book, and eventually stray away to other sources because its more dense than seemingly.
The book makes the distinction between supervised and unsupervised learning, in that supervised the goal is prediction, whereas in the latter, description of patterns is the only goal. While non-parametric prediction would be cool, both are interesting.
Let's begin with some supervised learning techniques.
Revisiting Best Subset Selection
One of the sections I visited was the discussion of subset selection, since I had some difficulty figuring out how to run a best subsets procedure in R. The book mentions backward-stepwise selection as one of the best subset selection models, but that it can only be used if the number of observations, \(N > \alpha\) is greater than the number of parameters. With our data set, we should have no problems there, but due to that restriction the book hardly mentions it.
Therefore, I had to look it up myself.
library(MASS)
fit <- lm(y~x1+x2+x3,data=mydata)
step <- stepAIC(fit, direction="both")
step$anova # display results
Of course, I have yet to use this on the Marlboro data set, but I'm saving this for the final write-up.
Non-linear Regression
In looking through the best subsets R code, I found an interesting looking paper by John Fox on Nonlinear Regression and Nonlinear Least Squares, so I figured I'D go through it.
And I'm stuck. I don't know what nls does.
Non-Parametric Regression
Most of what I've done can be found in the exploratory.R script.
For Next Time
After this, I'd like to get started on two things, the Marlboro data set writeup and final exam, in addition to playing around a bit more in R, especially with the statistical learning stuff.
Course Description
With the widespread adoption of the internet, the sheer amount of data accessible to researchers, companies, advertisers, and regular people has increased exponentially, creating a so called era of big data. This newfound increase in the quantity of information known about everything from consumer preferences to global trends has lead statisticians to consider how best to analyze and use enormous data sets to answer questions.
Building off of prior course work in Statistics, this course will cover selected topics from Probability Theory and Advanced Data Analysis, including multiple regression, time-series analysis, random variables, and stochastic processes.
In addition, the course will briefly look into techniques employed in data mining, and some non-parametric statistical learning. The course and subsequent projects will emphasize the use of the R statistical computing package and its various add-on packages to modern big data statistical analyses, building and improving on prior R experience. This theory will be applied to various questions that can be explored at Marlboro College as well as how big data and use of statistics freely available on the internet may be revolutionizing the way economic research is conducted. For students taking this class for five credits, some more time will be devoted to the use of data visualizations and info-graphics to present and display findings obtained from these data sets.
Entry for Spring Break
- For the first section, I've used export.csv (attached below). For the second section, I've used export2.csv (also attached).
- Before beginning, it's worth revisiting what question I said I would investigate for my independent project. Earlier, I said that the "question I'd like to investigate is exploring the link between admissions metrics (H.S. GPA, etc.) and success at Marlboro" (see earlier entry). I think that's pretty close to what I've been looking at over the break. Therefore, I've been working on building models to linking various admissions metrics to success at Marlboro. While I intended originally for success at Marlboro to be a holistic measure (GPA, improvement from H.S. GPA, Plan GPA), for now I've just been using Marlboro GPA and Plan GPA's as the dependent or response variable(this will change in time). Therefore, I've been trying to look at which explanatory variables influence GPA here the most. Then, before beginning, I'll summarize a bit of what the Regression Analysis book and R Cookbook have to say on building effective models.
Determining Statistical Significance
First, in order to determine whether our model is statistically significant, we must be able to reject the null hypothesis that \(\beta_1=\beta_2=...=\beta_n=0\), that all of our n parameters add nothing to our model and there is no relationship between our x's and y's. Therefore we look at the tabulated F-statistic to determine whether we can reject the null hypothesis, given by \(k\) (number of parameters) degrees of freedom in the numerator and \(n\) (number of observations) \(-(k+1)\) degrees of freedom in the denominator.
Before building some master multiple regression model that will tell us exactly who will be the most successful at Marlboro, let's look at a simple linear regression model and test its utility. This will provide a foundation for determining the best multiple regression model. I plotted Plan GPA against Marlboro GPA for a simple model, presuming that there would be a strong relationship between the two. To show what R outputs about the simple model:
>pmodel <- lm(Plan.GPA ~ Marlboro.GPA)
> summary(pmodel)
Coefficients: Residuals:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.27445 0.10396 2.64 0.0085 ** Min 1Q Median 3Q Max
Marlboro.GPA 0.96002 0.03054 31.44 <2e-16 *** -1.0969 -0.1469 -0.0125 0.1457 1.0339
Residual standard error: 0.2582 on 617 degrees of freedom
(635 observations deleted due to missingness) #entries without Plan GPA's, current and withdrawn students
Multiple R-squared: 0.6157, Adjusted R-squared: 0.615
F-statistic: 988.4 on 1 and 617 DF, p-value: < 2.2e-16
Then to reject our null hypothesis for this model, the critical F-value is given by 1 df in the numerator and 617 in the denominator, as the R-output shows. Then the tabulated F-value at 95% confidence is \(\cong 3.86\). Since our F-statistic is 988.4, since \(988.4 > 3.86\), this leads us to be able to reject our null hypothesis. Similarly, looking at our p-value and 95% confidence level, since \(0.05>2.2e-16\), we have reason to believe that our model is significant. This states that there is only a \(2.2(10^{-16})\) probability that our model is insignificant, which is good.
Now since the model has been shown to be significant, we can turn to the coefficient of determination, or \(r^2\) and coefficient of variation (although we'll primarily be looking at R-squared). Let's look first at the Residual standard error of 0.2582 on the output. This is the sample standard deviation of our error \(\epsilon\). Then we should expect most of our observed y-values (Plan GPA's) to fall within 2(0.2582) of the their predicted values given by the regression line (y=0.96x+0.27, in this case). Therefore, having a small estimated standard deviation of \(\epsilon\) is a good thing. To measure just how large this number is, we use the coefficient of variation, or the ratio between the estimated standard deviation of the error (given by s) and the sample mean (given by \(\bar y\), given as a percentage by the formula \( C.V. = 100(s / \bar y)\). Generally, we'd like for this to be smaller than 10%. Plugging in with a sample mean of 3.526, we get a C.V. of \( \cong 7.32\), which is lower than 10%. However, generally, most look at the r-squared value instead. Since we have a linear model, we can interpret our r-squared value of 0.6157 as that about 61.6% of the variation in the Plan GPA can be attributed to using Marlboro GPA to predict the final Plan score. Therefore, Marlboro GPA is fairly closely correlated to Plan GPA and serves as a pretty valuable predictor.
Moving onto Multiple Regression Models and Determining their Significance
In order to build a first-order multiple regression model (one with quantitative independent variables) in R, one can use the formula:
> model <- lm(y ~ u + v + w)
As before, R uses the least squares method to fit the regression line (lm) to the data. As before, we need a global test to check if the model is statistically significant. Analogously, we find this out through our F-Test and rejection p-value. However, since we now have multiple parameters, it is worth finding some way to determine which ones are significant and which are not.
What we are looking for is the likelihood that the true coefficient of the parameter is zero, or \(\beta=0\), and adds nothing to our model. There are two methods of evaluating this in order to try to reject this null hypothesis. They are t statistics and our p value, noted by the R output as t value and Pr(>|t|) respectively. Let's look at a baseline multiple regression model I built to compare some parameters against GPA at Marlboro.
> model <- lm(Plan.GPA ~ Marlboro.GPA + Median.Income + Population)
> summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.632e-01 1.091e-01 2.412 0.0162 *
Marlboro.GPA 9.611e-01 3.064e-02 31.367 <2e-16 ***
Median.Income 1.662e-07 3.272e-07 0.508 0.6118
Population -2.318e-07 6.760e-07 -0.343 0.7317
----
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So, this model looks at how significant the Population Density of a student's neighborhood, median income of said neighborhood, and Marlboro GPA are in terms of predicting the score one should get on Plan, assuming that whatever relationship we might have, it is linear and that our x's are probabilistically independent and don't interact (of which, both certainly could be wrong). We already looked at the significance of Marlboro GPA on predicting this, and of course learned, it is pretty significant. So it's worth noting that the p-value for the entire linear model last time is equivalent to the p-value for simply Marlboro GPA as a parameter. It's also worth noting that the t-value, is again, the same. It's worth mentioning what our t-value is based on. It is given by \(\frac{\bar \beta_i}{s_{\bar \beta_t}}\) or the sample mean of the parameter over its sample standard deviation. Then Pr(>|t|), or the p-value, is analogous to what the book states about the rejection region for a parameter, as given by \( |t| > t_{\alpha/2}\), where \(t_{\alpha/2}\) is based upon n-(k+1) degrees of freedom. The p-value measures the likelihood that t is actually in this rejection region.
So then, let's look at the p-value for these parameters. The output states that while Marlboro GPA is significant, since both .6118 and .7317 > 0.05 (or whatever confidence level you wish to use), it seems pretty unlikely that the Population density or median income of one's area influence your plan score much. R helps you figure out what's interesting and what's not pretty quickly since, as shown by the signif. codes section, it will put a * (or .) next to the ones that exceed the given confidence level.
Let's examine the output for the \(R^2 and R_a^2\) for this model before moving on.
Multiple R-squared: 0.6159, Adjusted R-squared: 0.614
This suggests that even though Population and Median Income are insignificant, the Marlboro GPA continues to make this model look significant. Therefore, looking at individual parameters is an important step. Although the book only recommends evaluating the interaction or squared \(\beta\)'s, since R conducts the test on every variable it's still worth looking at the individual parameters, but obviously put the major emphasis on the p-value of the most important \(\beta\)'s.
Improving Multiple Regression Models
Then how can we improve this model? The first idea is to hypothesize whether the x's interact (are related to each other) and whether there is some curvature to the model. The second is to add some parameters and see whether or not the \(R^2_a\) and 2s terms increase. However, this can come at the expense of being parsimonious. The book describes a F-test for comparing nested models, but without summarizing the test, the result of the findings is to suggest to the analyst to cherry pick the most significant parameters through use of the f-statistic (and corresponding p-value). The third is to include qualitative factors like gender or race in our spreadsheet. However, the book presents a pretty cogent example for why we shouldn't just add everything into our model- we will quickly have a lot of \(\beta\) terms and therefore require a lot of observations to prove any sort of significance. These are all good ideas, but still, what exactly should we put into our model?
Therefore, there are two (well, there's more than that.. but) ways we can go about determining what should go into our model. The first way is to simply theorize which parameters will be significant and which won't. This may sound odd... given that I've been leading up to how we should find that out, but its helpful to keep in mind that although R can find correlation between two variables, its unclear whether that is a signal we're looking for or just noise. Although, for instance, inquiry students from a certain college list might, on average have higher GPA's at Marlboro, we still need to use a bit of intuition to determine whether or not this tells us anything about the actual predictive ability of that parameter. The book writes that the most helpful way (other than intuition and theory, which are often more important) to determine whether one's model is predictive and looks at causation, and not just correlation, is to add new data into the model, and see whether it is still significant at determining the dependent variable. In absence of new data, it also works to segment the data randomly and see whether the model works roughly the same for each data set. Fortunately, for admissions data, as the next year of students comes in, it is easy to re-evaluate one's model.
The other way is the variety of variable screening methods which statistical software provides us. The first way to build a model is to have R run through the various options for first order models and find which ones seem to be the most significant. The other is to have R run through all possibilities for models, including higher order terms and interaction terms, and look at which model has the highest \(R^2_a\) and \(2s\) score. As stated earlier, for some of these models we will need to use our intuition to select which more model is more predictive (and perhaps parsimonious), not just the one with the highest "on-paper" significance, one of the dangers of using R to build models. Other concerns include, as the book posits, the program will perform many tests on various models, and therefore the probability of a Type 1 or 2 error is high. Additionally, having the program run through all the possibilities of higher order models and interaction terms may result in nonsensical models. Therefore, as stressed earlier, common sense is necessary to make any predictive formula.
It looks like in R, this code will let me look at the various options for first order models:
> subset <- regsubsets(\(y ~ x_1 + x_2 + ... + x_n\))
Where nvmax = n defines the maximum size model. Then, we can call
> plot(subset, scale="")
Where the value for scale is either "Cp", "adjr2", "r2", or "bic". The "\(C_p\)" term is new, and it focuses on minimizing total mean square error and regression bias. It is measured as \(C_p \frac{SSE_p}{MSE_k}+2(p+1)-n\), where p is the number of \(\beta\)'s in our model, and n is sample size, and k is the total number of \(\beta\)'s. These help us determine which model is the most statistically significant. Then,
> summary(subset)
lets us know the specifics of each model, since plot() doesn't specify them.
Example of Model Building based on R Best Subsets Procedures
So, looking at the export2.csv data set, we have alot of data. And a lot of variables. We could theorize about the relationships some of them have on Marlboro GPA (or lack thereof), but let's use the best subsets procedure to quickly process all of them. So let's build this model using:
> subsets <-regsubsets(Marlboro.GPA ~ Gender + Taken.ACT. + Taken.AP.s. + Taken.SAT., data=stuff)
> plot(subsets) #see attached for image (titled plot.jpg)
> summary(subsets)
Subset selection object
Call: regsubsets.formula(Marlboro.GPA ~ Gender + Taken.ACT. + Taken.AP.s. + Taken.SAT., data = stuff)
5 Variables (and intercept)
Forced in Forced out
GenderM FALSE FALSE
GenderU FALSE FALSE
Taken.ACT.Y FALSE FALSE
Taken.AP.s.Y FALSE FALSE
Taken.SAT.Y FALSE FALSE
1 subsets of each size up to 5
Selection Algorithm: exhaustive
GenderM GenderU Taken.ACT.Y Taken.AP.s.Y Taken.SAT.Y
1 ( 1 ) " " " " " " " " "*"
2 ( 1 ) " " " " " " "*" "*"
3 ( 1 ) "*" " " " " "*" "*"
4 ( 1 ) "*" "*" " " "*" "*"
5 ( 1 ) "*" "*" "*" "*" "*"
However, if we want more descriptive statistics, we then need to build the model of our choosing individually and ask R to summarize it for us. So if we had wanted the fifth model, we would do:
> model ← lm(Marlboro.GPA ~ Gender + Taken.ACT. + Taken.AP.s. + Taken.SAT., data=stuff)
> summary(model)
Which gives us:
Residuals:
Min 1Q Median 3Q Max
-3.1570 -0.2048 0.1935 0.5030 1.3019
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.15696 0.03141 100.500 < 2e-16 ***
GenderM -0.11081 0.04401 -2.518 0.01191 *
GenderU 0.62304 0.79987 0.779 0.43616
Taken.ACT.Y -0.06300 0.10782 -0.584 0.55907
Taken.AP.s.Y 0.40542 0.12328 3.289 0.00103 **
Taken.SAT.Y -0.34800 0.06115 -5.691 1.56e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7993 on 1326 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.03292, Adjusted R-squared: 0.02927
F-statistic: 9.027 on 5 and 1326 DF, p-value: 1.892e-08
Other Considerations
Let's check now for multicollinearity in this model. Multicollinearity is said to exist when several of our predictor variables are correlated. In our case, I purposefully threw in a couple that are clearly related (Mean.Income and Median.Income) so we could perform some tests to determine multicollinearity. Signs of multicollinearity are significant global F-tests and insignificant t-tests for most of the parameters, and a variance inflation factor greater than 10, where VIF is given by \((VIF)_i = \frac{1}{1-R^2_i}\), and \(R^2_i\) is the multiple coefficient of determination. Then, basically, multicollinearity can exist when \(R^2_i \leq .1\). Then, for this particular model, it seems unlikely that the model expresses multicollinearity.
In Progress
Additional Notes
I got through the book up to the chapter on Time Series modeling and forecasting (10), since chapter 9 is mostly just describing what to do in various circumstances.
Durbin-Watson d statistic - residual correlation
\( d = \frac{\sum^n_{t=2}(\hat \epsilon_t = \hat \epsilon_{t-1})^2}{\sum^n_{t=1} \hat \epsilon^2_t}\)
Where \( 0 \leq d \leq 4\), and \(d \approx 2\) if the residuals are uncorrelated, d <2 if positively correlated, d>2 if negatively correlated, with 0 and 4 being the strongest cases.
Loess Regression
I looked into loess regression in the non-parametric statistics book, but it merely mentions the formula for the loess regression and other regressions. I probably need to look for specific