Applied
-
In this exercise, we will generate simulated data, and will then use this data to perform forward and backward stepwise selection.
-
(a) Create a random number generator and use its
normal()method to generate a predictor X of length n = 100, as well as a noise vector ϵ of length n = 100. -
(b) Generate a response vector Y of length n = 100 according to the model
-
where β 0, β 1, β 2, and β 3 are constants of your choice.
-
(c) Use forward stepwise selection in order to select a model containing the predictors X, X[2] , . . . , X[10] . What is the model obtained according to Cp ? Report the coefficients of the model obtained.
-
(d) Repeat (c), using backwards stepwise selection. How does your answer compare to the results in (c)?
-
(e) Now fit a lasso model to the simulated data, again using X, X[2] , . . . , X[10] as predictors. Use cross-validation to select the optimal value of λ . Create plots of the cross-validation error as a function of λ . Report the resulting coefficient estimates, and discuss the results obtained.
-
(f) Now generate a response vector Y according to the model
and perform forward stepwise selection and the lasso. Discuss the results obtained.
-
In this exercise, we will predict the number of applications received using the other variables in the
Collegedata set.-
(a) Split the data set into a training set and a test set.
-
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
-
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
-
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
-
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
-
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
-
6.6 Exercises 287
- (g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
- We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.
- (a) Generate a data set with p = 20 features, n = 1 , 000 observations, and an associated quantitative response vector generated according to the model