11.3 The Kaplan–Meier Survival Curve

The survival curve , or survival function , is defined as

\[S(t) = \Pr(T > t)\]

survival curve survival function

This decreasing function quantifies the probability of surviving past time t . For example, suppose that a company is interested in modeling customer churn. Let T represent the time that a customer cancels a subscription to the company’s service. Then S ( t ) represents the probability that a customer cancels later than time t . The larger the value of S ( t ), the less likely that the customer will cancel before time t .

In this section, we will consider the task of estimating the survival curve. Our investigation is motivated by the BrainCancer dataset, which contains the survival times for patients with primary brain tumors undergoing treatment with stereotactic radiation methods.[2] The predictors are gtv (gross tumor volume, in cubic centimeters); sex (male or female); diagnosis (meningioma, LG glioma, HG glioma, or other); loc (the tumor location: either infratentorial or supratentorial); ki (Karnofsky index); and stereo (stereotactic method: either stereotactic radiosurgery or fractionated stereotactic radiotherapy, abbreviated as SRS and SRT, respectively). Only 53 of the 88 patients were still alive at the end of the study.

Now, we consider the task of estimating the survival curve (11.2) for these data. To estimate S (20) = Pr( T > 20), the probability that a patient survives for at least t = 20 months, it is tempting to simply compute the proportion of patients who are known to have survived past 20 months, i.e. the proportion of patients for whom Y > 20. This turns out to be 48 / 88, or approximately 55%. However, this does not seem quite right, since Y and T represent different quantities. In particular, 17 of the 40 patients who did not survive to 20 months were actually censored, and this analysis implicitly assumes that T < 20 for all of those censored patients; of course, we do not know whether that is true.

Alternatively, to estimate S (20), we could consider computing the proportion of patients for whom Y > 20, out of the 71 patients who were not censored by time t = 20; this comes out to 48 / 71, or approximately 68%. However, this is not quite right either, since it amounts to completely ignoring the patients who were censored before time t = 20, even though the time at which they are censored is potentially informative. For instance, a patient who was censored at time t = 19 . 9 likely would have survived past t = 20 had he or she not been censored.

We have seen that estimating S ( t ) is complicated by the presence of censoring. We now present an approach to overcome these challenges. We let d 1 < d 2 < · · · < dK denote the K unique death times among the noncensored patients, and we let qk denote the number of patients who died at time dk . For k = 1 , . . . , K , we let rk denote the number of patients alive

2This dataset is described in the following paper: Selingerová et al. (2016) Survival of patients with primary brain tumors: Comparison of two statistical approaches. PLoS One, 11(2):e0148733.

11.3 The Kaplan–Meier Survival Curve 473

and in the study just before dk ; these are the at risk patients. The set of patients that are at risk at a given time are referred to as the risk set . By the law of total probability,[3]

\[\Pr(T > d_k) = \Pr(T > d_k \mid T > d_{k-1}) \Pr(T > d_{k-1}) + \Pr(T > d_k \mid T \le d_{k-1}) \Pr(T \le d_{k-1}) \quad (11.2)\]

risk set

The fact that dk− 1 < dk implies that Pr( _T > dk T ≤ dk−_ 1) = 0 (it is impossible for a patient to survive past time dk if he or she did not survive until an earlier time dk− 1). Therefore,
\[\Pr(T > d_k \mid T \le d_{k-1}) = 0\]

Plugging in (11.2) again, we see that

\[\Pr(T > d_k) = \Pr(T > d_k \mid T > d_{k-1}) \Pr(T > d_{k-1})\]

This implies that

\[S(d_k) = \Pr(T > d_k \mid T > d_{k-1}) \dots \Pr(T > d_2 \mid T > d_1) \Pr(T > d_1) \quad (11.3)\]

We now must simply plug in estimates of each of the terms on the righthand side of the previous equation. It is natural to use the estimator

\[\hat{\Pr}(T > d_j \mid T > d_{j-1}) = \frac{n_j - q_j}{n_j} = 1 - \frac{q_j}{n_j}\]

which is the fraction of the risk set at time dj who survived past time dj . This leads to the Kaplan–Meier estimator of the survival curve:

\[\hat{S}(d_k) = \prod_{j=1}^k \left( \frac{n_j - q_j}{n_j} \right) \quad (11.4)\]

Kaplan– Meier estimator

For times t between dk and dk +1, we set S[�] ( t ) = S[�] ( dk ). Consequently, the Kaplan–Meier survival curve has a step-like shape.

The Kaplan–Meier survival curve for the BrainCancer data is displayed in Figure 11.2. Each point in the solid step-like curve shows the estimated probability of surviving past the time indicated on the horizontal axis. The estimated probability of survival past 20 months is 71%, which is quite a bit higher than the naive estimates of 55% and 68% presented earlier.

The sequential construction of the Kaplan–Meier estimator — starting at time zero and mapping out the observed events as they unfold in time — is fundamental to many of the key techniques in survival analysis. These include the log-rank test of Section 11.4, and Cox’s proportional hazard model of Section 11.5.2.

3The law of total probability states that for any two events A and B , Pr( A ) = Pr( _A B_ ) Pr( B ) + Pr( _A B[c]_ ) Pr( B[c] ), where B[c] is the complement of the event B , i.e. it is the event that B does not hold.

474 11. Survival Analysis and Censored Data

Figure 11.2

FIGURE 11.2. For the BrainCancer data, we display the Kaplan–Meier survival curve (solid curve), along with standard error bands (dashed curves).

Figure 11.3

FIGURE 11.3. For the BrainCancer data, Kaplan–Meier survival curves for males and females are displayed.

서브목차