Applied

In this exercise, we will generate simulated data, and will then use this data to perform forward and backward stepwise selection.
- (a) Create a random number generator and use its normal() method to generate a predictor X of length n = 100, as well as a noise vector ϵ of length n = 100.
- (b) Generate a response vector Y of length n = 100 according to the model

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon\]

where β 0, β 1, β 2, and β 3 are constants of your choice.

(c) Use forward stepwise selection in order to select a model containing the predictors X, X[2] , . . . , X[10] . What is the model obtained according to Cp ? Report the coefficients of the model obtained.
(d) Repeat (c), using backwards stepwise selection. How does your answer compare to the results in (c)?
(e) Now fit a lasso model to the simulated data, again using X, X[2] , . . . , X[10] as predictors. Use cross-validation to select the optimal value of λ . Create plots of the cross-validation error as a function of λ . Report the resulting coefficient estimates, and discuss the results obtained.
(f) Now generate a response vector Y according to the model

\[Y = \beta_0 + \beta_7 X^7 + \epsilon\]

and perform forward stepwise selection and the lasso. Discuss the results obtained.

6.6 Exercises 287

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

(a) Generate a data set with p = 20 features, n = 1 , 000 observations, and an associated quantitative response vector generated according to the model

서브목차