6.2.2 The Lasso

Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will includewill shrink all of the coefficients towards zero, but it will not set any of themall p predictors in the final model. The penalty λ[�] βj[2][in (][6.5][)] exactly to zero (unless λ = ∞ ). This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large. For example, in the Credit data set, it appears that the most important variables are income , limit , rating , and student . So we might wish to build a model including just these predictors. However, ridge regression will always generate a model involving all ten predictors. Increasing the value of λ will tend to reduce the magnitudes of the coefficients, but will not result in exclusion of any of the variables.

The lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, β[ˆ] λ[L][,][minimize the quantity

\[\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p |\beta_j| = \text{RSS} + \lambda \sum_{j=1}^p |\beta_j| \quad (6.7)\]

Comparing (6.7) to (6.5), we see that the lasso and ridge regression have similar formulations. The only difference is that the βj[2][term][in][the][ridge] regression penalty (6.5) has been replaced by _

βj

_ in the lasso penalty (6.7). In statistical parlance, the lasso uses an ℓ 1 (pronounced “ell 1”) penalty instead of an ℓ 2 penalty. The ℓ 1 norm of a coefficient vector β is given by ∥β∥ 1 =[�] _

βj

_ .

As with ridge regression, the lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the ℓ 1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection . As a result, models generated from the lasso are generally much easier to interpret than those produced by ridge regression. We say that the lasso yields sparse models—that is, sparse models that involve only a subset of the variables. As in ridge regression, selecting a good value of λ for the lasso is critical; we defer this discussion to Section 6.2.3, where we use cross-validation.

6.2 Shrinkage Methods 245

Figure 6.6

FIGURE 6.6. The standardized lasso coefficients on the Credit data set are shown as a function of λ and ∥β[ˆ] λ[L][∥][1] [/][∥][β][ˆ] [∥][1] [.]

As an example, consider the coefficient plots in Figure 6.6, which are generated from applying the lasso to the Credit data set. When λ = 0, then the lasso simply gives the least squares fit, and when λ becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero. However, in between these two extremes, the ridge regression and lasso models are quite different from each other. Moving from left to right in the right-hand panel of Figure 6.6, we observe that at first the lasso results in a model that contains only the rating predictor. Then student and limit enter the model almost simultaneously, shortly followed by income . Eventually, the remaining variables enter the model. Hence, depending on the value of λ , the lasso can produce a model involving any number of variables. In contrast, ridge regression will always include all of the variables in the model, although the magnitude of the coefficient estimates will depend on λ .

Another Formulation for Ridge Regression and the Lasso

One can show that the lasso and ridge regression coefficient estimates solve the problems

\[\begin{align*} \underset{\beta}{\text{minimize}} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \quad &\text{subject to} \quad \sum_{j=1}^p |\beta_j| \le s \quad (6.8) \\ \underset{\beta}{\text{minimize}} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \quad &\text{subject to} \quad \sum_{j=1}^p \beta_j^2 \le s \quad (6.9) \end{align*}\]

respectively. In other words, for every value of λ , there is some s such that the Equations (6.7) and (6.8) will give the same lasso coefficient estimates. Similarly, for every value of λ there is a corresponding s such that Equations (6.5) and (6.9) will give the same ridge regression coefficient estimates.

246 6. Linear Model Selection and Regularization

When p = 2, then (6.8) indicates that the lasso coefficient estimates have the smallest RSS out of all points that lie within the diamond defined by _

β_ 1 _

_ + _

β_ 2 _

≤ s_ . Similarly, the ridge regression estimates have the smallest RSS out of all points that lie within the circle defined by β 1[2][+] [ β] 2[2] [≤][s][.]

We can think of (6.8) as follows. When we perform the lasso we are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constraint that there is a budget s for how large[�] [p] j =1 _[

][β][j][

][can][be.] When _s is extremely large, then this budget is not very restrictive, and so the coefficient estimates can be large. In fact, if s is large enough that the least squares solution falls within the budget, then (6.8) will simply yield the least squares solution. In contrast, if s is small, then[�] [p] j =1 _[

][β][j][

][ must be] small in order to avoid violating the budget. Similarly, (6.9) indicates that when we perform ridge regression, we seek a set of coefficient estimates such that the RSS is as small as possible, subject to the requirement that � _pj =1 [β] j[2][not][exceed][the][budget] [s][.] The formulations (6.8) and (6.9) reveal a close connection between the lasso, ridge regression, and best subset selection. Consider the problem

\[\underset{\beta}{\text{minimize}} \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \quad \text{subject to} \quad \sum_{j=1}^p I(\beta_j \ne 0) \le s \quad (6.10)\]

Here I ( βj = 0) is an indicator variable: it takes on a value of 1 if βj = 0, and equals zero otherwise. Then (6.10) amounts to finding a set of coefficient estimates such that RSS is as small as possible, subject to the constraint that no more than s coefficients can be nonzero. The problem (6.10) is equivalent to best subset selection. Unfortunately, solving (6.10) is computationally infeasible when p is large, since it requires considering all � ps � models containing s predictors. Therefore, we can interpret ridge regression and the lasso as computationally feasible alternatives to best subset selection that replace the intractable form of the budget in (6.10) with forms that are much easier to solve. Of course, the lasso is much more closely related to best subset selection, since the lasso performs feature selection for s sufficiently small in (6.8), while ridge regression does not.

Sub-Chapters (하위 목차)

The Variable Selection Property of the Lasso (라쏘의 변수 선택 특성 메커니즘)

문서로 이동하기

어떻게 라쏘가 계수를 정확히 0으로 타겟 쳐내주어 희소(Sparse)한 모델을 기하학적으로 생성하는지, 그 수학적 원리를 마름모 궤적으로 시각화하여 관찰합니다.

Comparing the Lasso and Ridge Regression (라쏘와 릿지 회귀의 구조적 차이 비교)

문서로 이동하기

소수의 변수만 반응에 유의미할 땐 라쏘가, 다수 변수들이 전반적으로 미세하게 얽혀 영향을 미칠 땐 릿지가 상대적으로 유리할 가능성이 높다는 구도를 설명합니다.

A Simple Special Case for Ridge Regression and the Lasso (릿지와 라쏘의 특수한 교안 사례)

문서로 이동하기

변수 행렬 데이터 디자인이 완벽한 상호 직교(Orthogonal) 구조라 가정할 때, 두 기법 방식이 원래 OLS 선형 회귀 계수 추정값을 어떻게 소프트/하드로 깎는지 단순화해 증명합니다.

서브목차