12.2.1 What Are Principal Components?

Suppose that we wish to visualize n observations with measurements on a set of p features, X 1 , X 2 , . . . , Xp , as part of an exploratory data analysis. We could do this by examining two-dimensional scatterplots of the data, each of which contains the n observations’ measurements on two of the features. However, there are � p 2� = p ( p− 1) / 2 such scatterplots; for example, with p = 10 there are 45 plots! If p is large, then it will certainly not be possible to look at all of them; moreover, most likely none of them will be informative since they each contain just a small fraction of the total information present in the data set. Clearly, a better method is required to visualize the n observations when p is large. In particular, we would like to find a low-dimensional representation of the data that captures as much of the information as possible. For instance, if we can obtain a two-dimensional representation of the data that captures most of the information, then we can plot the observations in this low-dimensional space.

PCA provides a tool to do just this. It finds a low-dimensional representation of a data set that contains as much as possible of the variation. The idea is that each of the n observations lives in p -dimensional space, but not all of these dimensions are equally interesting. PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is measured by the amount that the observations vary along each dimension. Each of the dimensions found by PCA is a linear combination of the p features. We now explain the manner in which these dimensions, or principal components , are found.

The first principal component of a set of features X 1 , X 2 , . . . , Xp is the normalized linear combination of the features

\[Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p \quad (12.1)\]

that has the largest variance. By normalized , we mean that[�] [p] j =1 [φ][2] j 1[= 1][.] We refer to the elements φ 11 , . . . , φp 1 as the loadings of the first principal loading component; together, the loadings make up the principal component loading vector, φ 1 = ( φ 11 φ 21 . . . φp 1) [T] . We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance.

Given an n × p data set X , how do we compute the first principal component? Since we are only interested in variance, we assume that each of the variables in X has been centered to have mean zero (that is, the column means of X are zero). We then look for the linear combination of the sample feature values of the form

\[z_{i1} = \phi_{11} x_{i1} + \phi_{21} x_{i2} + \dots + \phi_{p1} x_{ip} \quad (12.2)\]

506 12. Unsupervised Learning

that has largest sample variance, subject to the constraint that[�] [p] j =1 [φ][2] j 1[=1][.] In other words, the first principal component loading vector solves the optimization problem

\[\begin{align*} \underset{\phi_{11}, \dots, \phi_{p1}}{\text{maximize}} & \left\{ \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j1} x_{ij} \right)^2 \right\} \quad (12.3) \\ \text{subject to } & \sum_{j=1}^p \phi_{j1}^2 = 1 \end{align*}\]

From (12.2) we can write the objective in (12.3) as n 1 � ni =1 [z] i[2] 1[.][Since] n 1 � ni =1 [x][ij][=][0][,][the][average][of][the] [z][11] [, . . . , z][n][1][will][be][zero][as][well.][Hence] the objective that we are maximizing in (12.3) is just the sample variance of the n values of zi 1. We refer to z 11 , . . . , zn 1 as the scores of the first princi- score pal component. Problem (12.3) can be solved via an eigen decomposition , eigen decoma standard technique in linear algebra, but the details are outside of the position scope of this book.[1]

There is a nice geometric interpretation of the first principal component. The loading vector φ 1 with elements φ 11 , φ 21 , . . . , φp 1 defines a direction in feature space along which the data vary the most. If we project the n data points x 1 , . . . , xn onto this direction, the projected values are the principal component scores z 11 , . . . , zn 1 themselves. For instance, Figure 6.14 on page 254 displays the first principal component loading vector (green solid line) on an advertising data set. In these data, there are only two features, and so the observations as well as the first principal component loading vector can be easily displayed. As can be seen from (6.19), in that data set φ 11 = 0 . 839 and φ 21 = 0 . 544.

After the first principal component Z 1 of the features has been determined, we can find the second principal component Z 2. The second principal component is the linear combination of X 1 , . . . , Xp that has maximal variance out of all linear combinations that are uncorrelated with Z 1. The second principal component scores z 12 , z 22 , . . . , zn 2 take the form

\[z_{i2} = \phi_{12} x_{i1} + \phi_{22} x_{i2} + \dots + \phi_{p2} x_{ip} \quad (12.4)\]

where φ 2 is the second principal component loading vector, with elements φ 12 , φ 22 , . . . , φp 2. It turns out that constraining Z 2 to be uncorrelated with Z 1 is equivalent to constraining the direction φ 2 to be orthogonal (perpendicular) to the direction φ 1. In the example in Figure 6.14, the observations lie in two-dimensional space (since p = 2), and so once we have found φ 1, there is only one possibility for φ 2, which is shown as a blue dashed line. (From Section 6.3.1, we know that φ 12 = 0 . 544 and φ 22 = − 0 . 839.) But in a larger data set with p > 2 variables, there are multiple distinct principal components, and they are defined in a similar manner. To find φ 2, we solve a problem similar to (12.3) with φ 2 replacing φ 1, and with the additional constraint that φ 2 is orthogonal to φ 1.[2]

1As an alternative to the eigen decomposition, a related technique called the singular value decomposition can be used. This will be explored in the lab at the end of this chapter.

2On a technical note, the principal component directions φ 1, φ 2, φ 3 , . . . are given by the ordered sequence of eigenvectors of the matrix X [T] X , and the variances of the components are the eigenvalues. There are at most min( n − 1 , p ) principal components.

12.2 Principal Components Analysis 507

Figure 12.1

FIGURE 12.1. The first two principal components for the USArrests data. The blue state names represent the scores for the first two principal components. The orange arrows indicate the first two principal component loading vectors (with axes on the top and right). For example, the loading for Rape on the first component is 0 . 54 , and its loading on the second principal component 0 . 17 (the word Rape is centered at the point (0 . 54 , 0 . 17) ). This figure is known as a biplot, because it displays both the principal component scores and the principal component loadings.

Once we have computed the principal components, we can plot them against each other in order to produce low-dimensional views of the data. For instance, we can plot the score vector Z 1 against Z 2, Z 1 against Z 3, Z 2 against Z 3, and so forth. Geometrically, this amounts to projecting the original data down onto the subspace spanned by φ 1, φ 2, and φ 3, and plotting the projected points.

We illustrate the use of PCA on the USArrests data set. For each of the 50 states in the United States, the data set contains the number of arrests per 100 , 000 residents for each of three crimes: Assault , Murder , and Rape . We also record UrbanPop (the percent of the population in each state living in urban areas). The principal component score vectors have length n = 50, and the principal component loading vectors have length p = 4. PCA was performed after standardizing each variable to have mean zero and standard

Unsupervised Learning

508

	PC1	PC2
`Murder`	0.5358995	_−_0.4181809
`Assault`	0.5831836	_−_0.1879856
`UrbanPop`	0.2781909	0.8728062
`Rape`	0.5434321	0.1673186

TABLE 12.1. The principal component loading vectors, φ 1 and φ 2 , for the USArrests data. These are also displayed in Figure 12.1.

deviation one. Figure 12.1 plots the first two principal components of these data. The figure represents both the principal component scores and the loading vectors in a single biplot display. The loadings are also given in biplot Table 12.2.1.

In Figure 12.1, we see that the first loading vector places approximately equal weight on Assault , Murder , and Rape , but with much less weight on UrbanPop . Hence this component roughly corresponds to a measure of overall rates of serious crimes. The second loading vector places most of its weight on UrbanPop and much less weight on the other three features. Hence, this component roughly corresponds to the level of urbanization of the state. Overall, we see that the crime-related variables ( Murder , Assault , and Rape ) are located close to each other, and that the UrbanPop variable is far from the other three. This indicates that the crime-related variables are correlated with each other—states with high murder rates tend to have high assault and rape rates—and that the UrbanPop variable is less correlated with the other three.

We can examine differences between the states via the two principal component score vectors shown in Figure 12.1. Our discussion of the loading vectors suggests that states with large positive scores on the first component, such as California, Nevada and Florida, have high crime rates, while states like North Dakota, with negative scores on the first component, have low crime rates. California also has a high score on the second component, indicating a high level of urbanization, while the opposite is true for states like Mississippi. States close to zero on both components, such as Indiana, have approximately average levels of both crime and urbanization.

서브목차