2.2.3 The Classification Setting

Thus far, our discussion of model accuracy has been focused on the regression setting. But many of the concepts that we have encountered, such as the bias-variance trade-off, transfer over to the classification setting with only some modifications due to the fact that yi is no longer quantitative. Suppose that we seek to estimate f on the basis of training observations { ( x 1 , y 1) , . . . , ( xn, yn ) } , where now y 1 , . . . , yn are qualitative. The most common approach for quantifying the accuracy of our estimate f[ˆ] is the training error rate , the proportion of mistakes that are made if we apply error rate

2.2 Assessing Model Accuracy 35

our estimate f[ˆ] to the training observations:

\[\frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat{y}_i) \tag{2.8}\]

Here y ˆ i is the predicted class label for the i th observation using f[ˆ] . And ˆ ˆ ˆ I ( yi = yi ) is an indicator variable that equals 1 if yi = yi and zero if yi = yi . ˆ indicator If I ( yi = yi ) = 0 then the i th observation was classified correctly by our variable classification method; otherwise it was misclassified. Hence Equation 2.8 computes the fraction of incorrect classifications.

Equation 2.8 is referred to as the training error rate because it is com- training puted based on the data that was used to train our classifier. As in the error regression setting, we are most interested in the error rates that result from applying our classifier to test observations that were not used in training. The test error rate associated with a set of test observations of the form test error ( x 0 , y 0) is given by

\[\text{Ave}(I(y_0 \neq \hat{y}_0)) \tag{2.9}\]

where y ˆ0 is the predicted class label that results from applying the classifier to the test observation with predictor x 0. A good classifier is one for which the test error (2.9) is smallest.

The Bayes Classifier

It is possible to show (though the proof is outside of the scope of this book) that the test error rate given in (2.9) is minimized, on average, by a very simple classifier that assigns each observation to the most likely class, given its predictor values . In other words, we should simply assign a test observation with predictor vector x 0 to the class j for which

\[Pr(Y = j|X = x_0) \tag{2.10}\]
is largest. Note that (2.10) is a conditional probability : it is the probability conditional that Y = j , given the observed predictor vector x 0. This very simple clasprobability sifier is called the Bayes classifier . In a two-class problem where there are Bayes only two possible response values, say class 1 or class 2 , the Bayes classifier classifier corresponds to predicting class one if Pr( Y = 1 _ X_ = x 0) > 0 . 5, and class two otherwise.
Figure 2.13 provides an example using a simulated data set in a twodimensional space consisting of predictors X 1 and X 2. The orange and blue circles correspond to training observations that belong to two different classes. For each value of X 1 and X 2, there is a different probability of the response being orange or blue. Since this is simulated data, we know how the data were generated and we can calculate the conditional probabilities for each value of X 1 and X 2. The orange shaded region reflects the set of points for which Pr( Y = orange _ X_ ) is greater than 50 %, while the blue shaded region indicates the set of points for which the probability is below 50 %. The purple dashed line represents the points where the probability is exactly 50 %. This is called the Bayes decision boundary . The Bayes Bayes classifier’s prediction is determined by the Bayes decision boundary; an observation that falls on the orange side of the boundary will be assigned

decision boundary

36 2. Statistical Learning

Figure 2.13

FIGURE 2.13. A simulated data set consisting of 100 observations in each of two groups, indicated in blue and in orange. The purple dashed line represents the Bayes decision boundary. The orange background grid indicates the region in which a test observation will be assigned to the orange class, and the blue background grid indicates the region in which a test observation will be assigned to the blue class.

to the orange class, and similarly an observation on the blue side of the boundary will be assigned to the blue class.

The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. Since the Bayes classifier will always choose the class for which (2.10) is largest, the error rate will be $1 - \max_j Pr(Y=j X=x_0)$ at $X=x_0$. In general, the overall Bayes error rate is given by
\[1 - E(\max_j Pr(Y=j|X)) \tag{2.11}\]
where the expectation averages the probability over all possible values of X . For our simulated data, the Bayes error rate is 0 . 133. It is greater than zero, because the classes overlap in the true population, which implies that max j Pr( Y = _j X_ = x 0) < 1 for some values of x 0. The Bayes error rate is analogous to the irreducible error, discussed earlier.

Sub-Chapters (하위 목차)

K-Nearest Neighbors (K-최근접 이웃)

비모수적인 분류 환경에서 이론의 실제 구현체로 가장 직관적인 알고리즘인 K-최근접 이웃(KNN) 기법을 학습합니다. K 값의 크기 변화에 따라 결정 경계(Decision Boundary)가 어떻게 바뀌며, 그 과정 속 편향-분산 트레이드오프가 나타나는지를 배웁니다.

서브목차