Conceptual

  1. We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0 , 1 , 2 , . . . , p predictors. Explain your answers:

    • (a) Which of the three models with k predictors has the smallest training RSS?

    • (b) Which of the three models with k predictors has the smallest test RSS?

    • (c) True or False:

      • i. The predictors in the k -variable model identified by forward stepwise are a subset of the predictors in the ( k +1)-variable model identified by forward stepwise selection.

      • ii. The predictors in the k -variable model identified by backward stepwise are a subset of the predictors in the ( k + 1)variable model identified by backward stepwise selection.

      • iii. The predictors in the k -variable model identified by backward stepwise are a subset of the predictors in the ( k + 1)variable model identified by forward stepwise selection.

      • iv. The predictors in the k -variable model identified by forward stepwise are a subset of the predictors in the ( k +1)-variable model identified by backward stepwise selection.

      • v. The predictors in the k -variable model identified by best subset are a subset of the predictors in the ( k + 1)-variable model identified by best subset selection.

  2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

    • (a) The lasso, relative to least squares, is:

      • i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

      • ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

      • iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

      • iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

    • (b) Repeat (a) for ridge regression relative to least squares.

    • (c) Repeat (a) for non-linear methods relative to least squares.

284 6. Linear Model Selection and Regularization

  1. Suppose we estimate the regression coefficients in a linear regression model by minimizing
\[\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \quad \text{subject to} \quad \sum_{j=1}^p \beta_j^2 \le s\]

for a particular value of s . For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

  • (a) As we increase s from 0, the training RSS will:

    • i. Increase initially, and then eventually start decreasing in an inverted U shape.

    • ii. Decrease initially, and then eventually start increasing in a U shape.

    • iii. Steadily increase.

    • iv. Steadily decrease.

    • v. Remain constant.

  • (b) Repeat (a) for test RSS.

  • (c) Repeat (a) for variance.

  • (d) Repeat (a) for (squared) bias.

  • (e) Repeat (a) for the irreducible error.

  1. Suppose we estimate the regression coefficients in a linear regression model by minimizing
\[\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2\]

for a particular value of λ . For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.

  • (a) As we increase λ from 0, the training RSS will:

    • i. Increase initially, and then eventually start decreasing in an inverted U shape.

    • ii. Decrease initially, and then eventually start increasing in a U shape.

    • iii. Steadily increase.

    • iv. Steadily decrease.

    • v. Remain constant.

  • (b) Repeat (a) for test RSS.

  • (c) Repeat (a) for variance.

  • (d) Repeat (a) for (squared) bias.

  • (e) Repeat (a) for the irreducible error.

6.6 Exercises 285

  1. It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.

Suppose that n = 2, p = 2, x 11 = x 12, x 21 = x 22. Furthermore, suppose that y 1 + y 2 = 0 and x 11 + x 21 = 0 and x 12 + x 22 = 0, so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: β[ˆ] 0 = 0.

  • (a) Write out the ridge regression optimization problem in this setting.

  • (b) Argueˆ ˆthat in this setting, the ridge coefficient estimates satisfy β 1 = β 2.

  • (c) Write out the lasso optimization problem in this setting.

  • (d) Argue that in this setting, the lasso coefficients β[ˆ] 1 and β[ˆ] 2 are not unique—in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions.

  1. We will now explore (6.12) and (6.13) further.

    • (a) Consider (6.12) with p = 1. For some choice of y 1 and λ > 0, plot (6.12) as a function of β 1. Your plot should confirm that (6.12) is solved by (6.14).

    • (b) Consider (6.13) with p = 1. For some choice of y 1 and λ > 0, plot (6.13) as a function of β 1. Your plot should confirm that (6.13) is solved by (6.15).

  2. We will now derive the Bayesian connection to the lasso and ridge regression discussed in Section 6.2.2.

    • (a) Suppose that yi = β 0 + » pj =1 [x][ij][β][j][ +] [ϵ][i][where] [ ϵ][1] [, . . . , ϵ][n][are inde-] pendent and identically distributed from a N (0 , σ[2] ) distribution. Write out the likelihood for the data.

    • (b) Assume the following prior for β : β 1 , . . . , βp are independent and identically distributed according to a double-exponential distribution with mean 0 and common scale parameter b : i.e. p ( β ) = 21 b[exp(] _[− ][β][ ][/b][)][.][Write][out][the][posterior][for] _[β][in][this] setting.
    • (c) Argue that the lasso estimate is the mode for β under this posterior distribution.

    • (d) Now assume the following prior for β : β 1 , . . . , βp are independent and identically distributed according to a normal distribution with mean zero and variance c . Write out the posterior for β in this setting.

    • (e) Argue that the ridge regression estimate is both the mode and the mean for β under this posterior distribution.

286 6. Linear Model Selection and Regularization

서브목차