6.3.2 Partial Least Squares
The PCR approach that we just described involves identifying linear combinations, or directions , that best represent the predictors X 1 , . . . , Xp . These directions are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. That is, the response does not supervise the identification of the principal components. Consequently, PCR suffers from a drawback: there is no guarantee
8More details can be found in Section 3.5 of The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman.
6.3 Dimension Reduction Methods 261

FIGURE 6.21. For the advertising data, the first PLS direction (solid line) and first PCR direction (dotted line) are shown.
that the directions that best explain the predictors will also be the best directions to use for predicting the response. Unsupervised methods are discussed further in Chapter 12.
We now present partial least squares (PLS), a supervised alternative to PCR. Like PCR, PLS is a dimension reduction method, which first identifies partialsquaresleast a new set of features Z 1 , . . . , ZM that are linear combinations of the original features, and then fits a linear model via least squares using these M new features. But unlike PCR, PLS identifies these new features in a supervised way—that is, it makes use of the response Y in order to identify new features that not only approximate the old features well, but also that are related to the response . Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.
We now describe how the first PLS direction is computed. After standardizing the p predictors, PLS computes the first direction Z 1 by setting each φj 1 in (6.16) equal to the coefficient from the simple linear regression of Y onto Xj . One can show that this coefficient is proportional to the correlation between Y and Xj . Hence, in computing Z 1 =[�] [p] j =1 [φ][j][1] [X][j][,][PLS] places the highest weight on the variables that are most strongly related to the response.
Figure 6.21 displays an example of PLS on a synthetic dataset with Sales in each of 100 regions as the response, and two predictors; Population Size and Advertising Spending. The solid green line indicates the first PLS direction, while the dotted line shows the first principal component direction. PLS has chosen a direction that has less change in the ad dimension per unit change in the pop dimension, relative to PCA. This suggests that pop is more highly correlated with the response than is ad . The PLS direction does not fit the predictors as closely as does PCA, but it does a better job explaining the response.
To identify the second PLS direction we first adjust each of the variables for Z 1, by regressing each variable on Z 1 and taking residuals . These residuals can be interpreted as the remaining information that has not been explained by the first PLS direction. We then compute Z 2 using this or-
262 6. Linear Model Selection and Regularization
thogonalized data in exactly the same fashion as Z 1 was computed based on the original data. This iterative approach can be repeated M times to identify multiple PLS components Z 1 , . . . , ZM . Finally, at the end of this procedure, we use least squares to fit a linear model to predict Y using Z 1 , . . . , ZM in exactly the same fashion as for PCR.
As with PCR, the number M of partial least squares directions used in PLS is a tuning parameter that is typically chosen by cross-validation. We generally standardize the predictors and response before performing PLS.
PLS is popular in the field of chemometrics, where many variables arise from digitized spectrometry signals. In practice it often performs no better than ridge regression or PCR. While the supervised dimension reduction of PLS can reduce bias, it also has the potential to increase variance, so that the overall benefit of PLS relative to PCR is a wash.