8.1.1 Regression Trees

In order to motivate regression trees , we begin with a simple example.

regression tree

331

G. James et al., An Introduction to Statistical Learning , Springer Texts in Statistics,

Tree-Based Methods

332

Figure 8.1

FIGURE 8.1. For the Hitters data, a regression tree for predicting the log salary of a baseball player, based on the number of years that he has played in the major leagues and the number of hits that he made in the previous year. At a given internal node, the label (of the form Xj < tk) indicates the left-hand branch emanating from that split, and the right-hand branch corresponds to Xj ≥ tk. For instance, the split at the top of the tree results in two large branches. The left-hand branch corresponds to Years<4.5 , and the right-hand branch corresponds to Years>=4.5 . The tree has two internal nodes and three terminal nodes, or leaves. The number in each leaf is the mean of the response for the observations that fall there.

Predicting Baseball Players’ Salaries Using Regression Trees

We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)

Figure 8.1 shows a regression tree fit to this data. It consists of a series of splitting rules, starting at the top of the tree. The top split assigns observations having Years<4.5 to the left branch.[1] The predicted salary for these players is given by the mean response value for the players in the data set with Years<4.5 . For such players, the mean log salary is 5 . 107, and so we make a prediction of e[5] [.][107] thousands of dollars, i.e. $165,174, for these players. Players with Years>=4.5 are assigned to the right branch, and then that group is further subdivided by Hits . Overall, the tree stratifies or segments the players into three regions of predictor space: players who have played for four or fewer years, players who have played for five or more years and who made fewer than 118 hits last year, and players who have played for five or more years and who made at least 118 hits last year. These three regions can be written as R 1 ={X _ _ Years<4.5 }, R 2 ={X _ _ Years>=4.5 , Hits<117.5 } , and R 3 ={X _ _ Years>=4.5 , Hits>=117.5 }. Figure 8.2 illustrates

1Both Years and Hits are integers in these data; the function used to fit this tree labels the splits at the midpoint between two adjacent values.

8.1 The Basics of Decision Trees 333

Figure 8.2

FIGURE 8.2. The three-region partition for the Hitters data set from the regression tree illustrated in Figure 8.1.

the regions as a function of Years and Hits . The predicted salaries for these three groups are $1,000 ×e[5] [.][107] =$165,174, $1,000 ×e[5] [.][999] =$402,834, and $1,000 ×e[6] [.][740] =$845,346 respectively.

In keeping with the tree analogy, the regions R 1, R 2, and R 3 are known as terminal nodes or leaves of the tree. As is the case for Figure 8.1, decision terminal trees are typically drawn upside down , in the sense that the leaves are at node the bottom of the tree. The points along the tree where the predictor space leaf is split are referred to as internal nodes . In Figure 8.1, the two internal internal nodes are indicated by the text Years<4.5 and Hits<117.5 . We refer to the node segments of the trees that connect the nodes as branches .

node leaf internal node branch

We might interpret the regression tree displayed in Figure 8.1 as follows: Years is the most important factor in determining Salary , and players with less experience earn lower salaries than more experienced players. Given that a player is less experienced, the number of hits that he made in the previous year seems to play little role in his salary. But among players who have been in the major leagues for five or more years, the number of hits made in the previous year does affect salary, and players who made more hits last year tend to have higher salaries. The regression tree shown in Figure 8.1 is likely an over-simplification of the true relationship between Hits , Years , and Salary . However, it has advantages over other types of regression models (such as those seen in Chapters 3 and 6): it is easier to interpret, and has a nice graphical representation.

Prediction via Stratification of the Feature Space

We now discuss the process of building a regression tree. Roughly speaking, there are two steps.

We divide the predictor space — that is, the set of possible values for X 1 , X 2 , . . . , Xp — into J distinct and non-overlapping regions, R 1 , R 2 , . . . , RJ .

334 8. Tree-Based Methods

For every observation that falls into the region Rj , we make the same prediction, which is simply the mean of the response values for the training observations in Rj .

For instance, suppose that in Step 1 we obtain two regions, R 1 and R 2, and that the response mean of the training observations in the first region is 10, while the response mean of the training observations in the second region is 20. Then for a given observation X = x , if x ∈ R 1 we will predict a value of 10, and if x ∈ R 2 we will predict a value of 20.

We now elaborate on Step 1 above. How do we construct the regions R 1 , . . . , RJ ? In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes , for simplicity and for ease of interpretation of the resulting predictive model. The goal is to find boxes R 1 , . . . , RJ that minimize the RSS, given by

\[\sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2 \quad (8.1)\]

where y ˆ Rj is the mean response for the training observations within the j th box. Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into J boxes. For this reason, we take a top-down , greedy approach that is known as recursive binary splitting . The recursive approach is top-down because it begins at the top of the tree (at which point binary all observations belong to a single region) and then successively splits the splitting predictor space; each split is indicated via two new branches further down on the tree. It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

In order to perform recursive binary splitting, we first select the predictor Xj and the cutpoint s such that splitting the predictor space into the regions _{X

Xj < s}_ and _{X

Xj ≥ s}_ leads to the greatest possible reduction in RSS. (The notation _{X

Xj < s}_ means the region of predictor space in which Xj takes on a value less than s .) That is, we consider all predictors X 1 , . . . , Xp , and all possible values of the cutpoint s for each of the predictors, and then choose the predictor and cutpoint such that the resulting tree has the lowest RSS. In greater detail, for any j and s , we define the pair of half-planes

\[R_1(j, s) = \{X \mid X_j < s\} \quad \text{and} \quad R_2(j, s) = \{X \mid X_j \ge s\} \quad (8.2)\]

and we seek the value of j and s that minimize the equation

\[\sum_{i: x_i \in R_1(j, s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j, s)} (y_i - \hat{y}_{R_2})^2 \quad (8.3)\]

where y ˆ R 1 is the mean response for the training observations in R 1( j, s ), and y ˆ R 2 is the mean response for the training observations in R 2( j, s ). Finding the values of j and s that minimize (8.3) can be done quite quickly, especially when the number of features p is not too large.

Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within

8.1 The Basics of Decision Trees 335

Figure 8.3

FIGURE 8.3. Top Left: A partition of two-dimensional feature space that could not result from recursive binary splitting. Top Right: The output of recursive binary splitting on a two-dimensional example. Bottom Left: A tree corresponding to the partition in the top right panel. Bottom Right: A perspective plot of the prediction surface corresponding to that tree.

each of the resulting regions. However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions. Again, we look to split one of these three regions further, so as to minimize the RSS. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five observations.

Once the regions R 1 , . . . , RJ have been created, we predict the response for a given test observation using the mean of the training observations in the region to which that test observation belongs.

A five-region example of this approach is shown in Figure 8.3.

Tree Pruning

The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. This is because the resulting tree might be too complex. A smaller tree

336 8. Tree-Based Methods

with fewer splits (that is, fewer regions R 1 , . . . , RJ ) might lead to lower variance and better interpretation at the cost of a little bias. One possible alternative to the process described above is to build the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold. This strategy will result in smaller trees, but is too short-sighted since a seemingly worthless split early on in the tree might be followed by a very good split—that is, a split that leads to a large reduction in RSS later on.

Therefore, a better strategy is to grow a very large tree T 0, and then prune it back in order to obtain a subtree . How do we determine the best prune way to prune the tree? Intuitively, our goal is to select a subtree that leads to the lowest test error rate. Given a subtree, we can estimate its test error using cross-validation or the validation set approach. However, estimating the cross-validation error for every possible subtree would be too cumbersome, since there is an extremely large number of possible subtrees. Instead, we need a way to select a small set of subtrees for consideration.

subtree

Cost complexity pruning —also known as weakest link pruning —gives us cost a way to do just this. Rather than considering every possible subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter α . For each value of α there corresponds a subtree T ⊂ T 0 such that

\[\sum_{m=1}^{|T|} \sum_{i: x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T| \quad (8.4)\]

complexity pruning weakest link pruning

is as small as possible. Here _

_ indicates the number of terminal nodes of the tree T , Rm is the rectangle (i.e. the subset of predictor space) corresponding to the m th terminal node, and y ˆ Rm is the predicted response associated with Rm —that is, the mean of the training observations in Rm . The tuning parameter α controls a trade-off between the subtree’s complexity and its fit to the training data. When α = 0, then the subtree T will simply equal T 0, because then (8.4) just measures the training error. However, as α increases, there is a price to pay for having a tree with many terminal nodes, and so the quantity (8.4) will tend to be minimized for a smaller subtree. Equation 8.4 is reminiscent of the lasso (6.7) from Chapter 6, in which a similar formulation was used in order to control the complexity of a linear model.

It turns out that as we increase α from zero in (8.4), branches get pruned from the tree in a nested and predictable fashion, so obtaining the whole sequence of subtrees as a function of α is easy. We can select a value of α using a validation set or using cross-validation. We then return to the full data set and obtain the subtree corresponding to α . This process is summarized in Algorithm 8.1.

Figures 8.4 and 8.5 display the results of fitting and pruning a regression tree on the Hitters data, using nine of the features. First, we randomly divided the data set in half, yielding 132 observations in the training set and 131 observations in the test set. We then built a large regression tree on the training data and varied α in (8.4) in order to create subtrees with different numbers of terminal nodes. Finally, we performed six-fold crossvalidation in order to estimate the cross-validated $\text{MSE}$ of the trees as

8.1 The Basics of Decision Trees 337

Algorithm 8.1 Building a Regression Tree

Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α .
Use K-fold cross-validation to choose α . That is, divide the training observations into K folds. For each k = 1 , . . . , K :
- (a) Repeat Steps 1 and 2 on all but the k th fold of the training data.
- (b) Evaluate the mean squared prediction error on the data in the left-out k th fold, as a function of α .
- Average the results for each value of α , and pick α to minimize the average error.
Return the subtree from Step 2 that corresponds to the chosen value of α .

a function of α . (We chose to perform six-fold cross-validation because 132 is an exact multiple of six.) The unpruned regression tree is shown in Figure 8.4. The green curve in Figure 8.5 shows the CV error as a function of the number of leaves,[2] while the orange curve indicates the test error. Also shown are standard error bars around the estimated errors. For reference, the training error curve is shown in black. The CV error is a reasonable approximation of the test error: the CV error takes on its minimum for a three-node tree, while the test error also dips down at the three-node tree (though it takes on its lowest value at the ten-node tree). The pruned tree containing three terminal nodes is shown in Figure 8.1.

서브목차