4.4.4 Naive Bayes

In previous sections, we used Bayes’ theorem (4.15) to develop the LDA and QDA classifiers. Here, we use Bayes’ theorem to motivate the popular naive Bayes classifier.

Recall that Bayes’ theorem (4.15) provides an expression for the posterior probability pk ( x ) = Pr( $Y=$ _k

X_ = x ) in terms of π 1 , . . . , πK and f 1( x ) , . . . , fK ( x ). To use (4.15) in practice, we need estimates for π 1 , . . . , πK and f 1( x ) , . . . , fK ( x ). As we saw in previous sections, estimating the prior probabilities π 1 , . . . , πK is typically straightforward: for instance, we can estimate π ˆ $k$ as the proportion of training observations belonging to the $k$ th class, for $k$ = 1 , . . . , K .

However, estimating f 1( x ) , . . . , fK ( x ) is more subtle. Recall that fk ( x ) is the p -dimensional density function for an observation in the $k$ th class, for $k$ = 1 , . . . , K . In general, estimating a p -dimensional density function is challenging. In LDA, we make a very strong assumption that greatly simplifies the task: we assume that fk is the density function for a multivariate normal random variable with class-specific mean µk , and shared covariance matrix Σ . By contrast, in QDA, we assume that fk is the density function for a multivariate normal random variable with class-specific mean µk , and class-specific covariance matrix Σ $k$ . By making these very strong assumptions, we are able to replace the very challenging problem of estimating K p -dimensional density functions with the much simpler problem of estimating K p -dimensional mean vectors and one (in the case of LDA) or $K$ (in the case of QDA) ( p × p )-dimensional covariance matrices.

The naive Bayes classifier takes a different tack for estimating f 1( x ) , . . . , fK ( x ). Instead of assuming that these functions belong to a particular family of distributions (e.g. multivariate normal), we instead make a single assumption:

naive Bayes

Within the kth class, the p predictors are independent.

Stated mathematically, this assumption means that for $k$ = 1 , . . . , K ,

fk ( x ) = fk 1( x 1) × fk 2( x 2) × · · · × fkp ( xp ) , (4.29)

where fkj is the density function of the $j$ th predictor among observations in the $k$ th class.

Why is this assumption so powerful? Essentially, estimating a p -dimensional density function is challenging because we must consider not only the marginal distribution of each predictor — that is, the distribution of marginal each predictor on its own — but also the joint distribution of the predictors — that is, the association between the different predictors. In the case of joint a multivariate normal distribution, the association between the different predictors is summarized by the off-diagonal elements of the covariance matrix. However, in general, this association can be very hard to characterize, and exceedingly challenging to estimate. But by assuming that the p covariates are independent within each class, we completely eliminate the need to worry about the association between the p predictors, because we have simply assumed that there is no association between the predictors!

distribution joint distribution

Do we really believe the naive Bayes assumption that the p covariates are independent within each class? In most settings, we do not. But even though this modeling assumption is made for convenience, it often leads to

4.4 Generative Models for Classification 159

pretty decent results, especially in settings where n is not large enough relative to p for us to effectively estimate the joint distribution of the predictors within each class. In fact, since estimating a joint distribution requires such a huge amount of data, naive Bayes is a good choice in a wide range of settings. Essentially, the naive Bayes assumption introduces some bias, but reduces variance, leading to a classifier that works quite well in practice as a result of the bias-variance trade-off.

Once we have made the naive Bayes assumption, we can plug (4.29) into (4.15) to obtain an expression for the posterior probability,

for $k$ = 1 , . . . , K .

To estimate the one-dimensional density function fkj using training data x 1 j, . . . , xnj , we have a few options.

If Xj is quantitative, then we can assume that _Xj

Y_ = k ∼ N ( µjk, σjk[2][)][.] In other words, we assume that within each class, the $j$ th predictor is drawn from a (univariate) normal distribution. While this may sound a bit like QDA, there is one key difference, in that here we are assuming that the predictors are independent; this amounts to QDA with an additional assumption that the class-specific covariance matrix is diagonal.

If Xj is quantitative, then another option is to use a non-parametric estimate for fkj . A very simple way to do this is by making a histogram for the observations of the $j$ th predictor within each class. Then we can estimate fkj ( xj ) as the fraction of the training observations in the $k$ th class that belong to the same histogram bin as xj . Alternatively, we can use a kernel density estimator , which is kernel essentially a smoothed version of a histogram.
- density estimator
If Xj is qualitative, then we can simply count the proportion of training observations for the $j$ th predictor corresponding to each class. For instance, suppose that Xj ∈{ 1 , 2 , 3 } , and we have 100 observations in the $k$ th class. Suppose that the $j$ th predictor takes on values of 1, 2, and 3 in 32, 55, and 13 of those observations, respectively. Then we can estimate fkj as

We now consider the naive Bayes classifier in a toy example with p = 3 predictors and $K$ = 2 classes. The first two predictors are quantitative, and the third predictor is qualitative with three levels. Suppose further ˆ ˆ that π 1 = π 2 = 0 . 5. The estimated density functions f[ˆ] kj for $k$ = 1 , 2 and $j$ = 1 , 2 , 3 are displayed in Figure 4.10. Now suppose that we wish to classify a new observation, x[∗] = (0 . 4 , 1 . 5 , 1) [T] . It turns out that in this

Classification

160

Figure 4.10

FIGURE 4.10. In the toy example in Section 4.4.4, we generate data with p = 3 predictors and K = 2 classes. The first two predictors are quantitative, and the third predictor is qualitative with three levels. In each class, the estimated density for each of the three predictors is displayed. If the prior probabilities for the two classes are equal, then the observation x[∗] = (0 . 4 , 1 . 5 , 1) [T] has a 94 . 4% posterior probability of belonging to the first class.

		True default status No Yes Total	True default status No Yes Total
Predicted default status	No Yes	9621 244 46 89	9865 135
	Total	9667 333	10000

TABLE 4.8. Comparison of the naive Bayes predictions to the true default status for the 10 , 000 training observations in the Default data set, when we predict default for any observation for which P ( $Y=$ default _ X_ = x ) > 0 . 5 .

example, f[ˆ] 11(0 . 4) = 0 . 368, f[ˆ] 12(1 . 5) = 0 . 484, f[ˆ] 13(1) = 0 . 226, and f[ˆ] 21(0 . 4) = 0 . 030, f[ˆ] 22(1 . 5) = 0 . 130, f[ˆ] 23(1) = 0 . 616. Plugging these estimates into (4.30) results in posterior probability estimates of Pr( $Y=$ 1 _

X_ = x[∗] ) = 0 . 944 and Pr( $Y=$ 2 _

X_ = x[∗] ) = 0 . 056.

Table 4.8 provides the confusion matrix resulting from applying the naive Bayes classifier to the Default data set, where we predict a default if the posterior probability of a default — that is, P ( $Y=$ default _ X_ = x ) — exceeds 0 . 5. Comparing this to the results for LDA in Table 4.4, our findings are mixed. While LDA has a slightly lower overall error rate, naive Bayes

4.5 A Comparison of Classification Methods

161

		True default status No Yes Total	True default status No Yes Total
Predicted default status	No Yes	9339 130 328 203	9469 531
	Total	9667 333	10000

TABLE 4.9. Comparison of the naive Bayes predictions to the true default status for the 10 , 000 training observations in the Default data set, when we predict default for any observation for which P ( $Y=$ default _ X_ = x ) > 0 . 2 .

correctly predicts a higher fraction of the true defaulters. In this implementation of naive Bayes, we have assumed that each quantitative predictor is drawn from a Gaussian distribution (and, of course, that within each class, each predictor is independent).

Just as with LDA, we can easily adjust the probability threshold for predicting a default. For example, Table 4.9 provides the confusion matrix resulting from predicting a default if P ( $Y=$ default _

X_ = x ) > 0 . 2. Again, the results are mixed relative to LDA with the same threshold (Table 4.5). Naive Bayes has a higher error rate, but correctly predicts almost two-thirds of the true defaults.

In this example, it should not be too surprising that naive Bayes does not convincingly outperform LDA: this data set has n = 10 , 000 and p = 2, and so the reduction in variance resulting from the naive Bayes assumption is not necessarily worthwhile. We expect to see a greater pay-off to using naive Bayes relative to LDA or QDA in instances where p is larger or n is smaller, so that reducing the variance is very important.

서브목차