7.5.1 An Overview of Smoothing Splines
In fitting a smooth curve to a set of data, what we really want to do is find some function, say g ( x ), that fits the observed data well: that is, we want RSS =[�] [n] i =1[(] [y][i][−][g][(] [x][i][))][2][to][be][small.][However,][there][is][a][problem] with this approach. If we don’t put any constraints on g ( xi ), then we can always make RSS zero simply by choosing g such that it interpolates all of the yi . Such a function would woefully overfit the data—it would be far too flexible. What we really want is a function g that makes RSS small, but that is also smooth .
How might we ensure that g is smooth? There are a number of ways to do this. A natural approach is to find the function g that minimizes
\[\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 dt \quad (7.11)\]where λ is a nonnegative tuning parameter . The function g that minimizes (7.11) is known as a smoothing spline .
smoothing spline
What does (7.11) mean? Equation 7.11 takes the “Loss+Penalty” forspline mulation that we encounter in the context of ridge regression and the lasso in Chapter 6. The term[�] [n] i =1[(] [y][i][ −][g][(] [x][i][))][2][is][a] [loss][function][that][encour-] loss function ages g to fit the data well, and the term λ � g[′′] ( t )[2] dt is a penalty term that penalizes the variability in g . The notation g[′′] ( t ) indicates the second derivative of the function g . The first derivative g[′] ( t ) measures the slope
7.5 Smoothing Splines 301
of a function at t , and the second derivative corresponds to the amount by which the slope is changing. Hence, broadly speaking, the second derivative of a function is a measure of its roughness : it is large in absolute value if g ( t ) is very wiggly near t , and it is close to zero otherwise. (The second derivative of a straight line is zero; note that a line is perfectly smooth.) The � notation is an integral , which we can think of as a summation over the range of t . In other words, � g[′′] ( t )[2] dt is simply a measure of the total change in the function g[′] ( t ), over its entire range. If g is very smooth, then g[′] ( t ) will be close to constant and � g[′′] ( t )[2] dt will take on a small value. Conversely, if g is jumpy and variable then g[′] ( t ) will vary significantly and � g[′′] ( t )[2] dt will take on a large value. Therefore, in (7.11), λ � g[′′] ( t )[2] dt encourages g to be smooth. The larger the value of λ , the smoother g will be. When λ = 0, then the penalty term in (7.11) has no effect, and so the function g will be very jumpy and will exactly interpolate the training observations. When λ →∞ , g will be perfectly smooth—it will just be a straight line that passes as closely as possible to the training points. In fact, in this case, g will be the linear least squares line, since the loss function in (7.11) amounts to minimizing the residual sum of squares. For an intermediate value of λ , g will approximate the training observations but will be somewhat smooth. We see that λ controls the bias-variance trade-off of the smoothing spline.
The function g ( x ) that minimizes (7.11) can be shown to have some special properties: it is a piecewise cubic polynomial with knots at the unique values of x 1 , . . . , xn , and continuous first and second derivatives at each knot. Furthermore, it is linear in the region outside of the extreme knots. In other words, the function g ( x ) that minimizes (7.11) is a natural cubic spline with knots at x 1 , . . . , xn! However, it is not the same natural cubic spline that one would get if one applied the basis function approach described in Section 7.4.3 with knots at x 1 , . . . , xn —rather, it is a shrunken version of such a natural cubic spline, where the value of the tuning parameter λ in (7.11) controls the level of shrinkage.