Skip to main content

Statistical Foundations

Survival Analysis

Modeling time-to-event data with censoring: Kaplan-Meier curves, hazard functions, and the Cox proportional hazards model.

AdvancedTier 2StableSupporting~55 min

Why This Matters

Many important questions in ML and beyond are about when something happens, not just whether it happens. When will a customer churn? How long until a machine fails? When will a patient relapse?

The twist: you often do not observe the event for everyone. A clinical trial ends before all patients have relapsed. A customer is still active at the end of your observation window. This is censoring, and it makes standard regression inapplicable. Survival analysis is the framework for handling time-to-event data with censoring, and it appears throughout ML in churn prediction, reliability engineering, and clinical AI.

Mental Model

Imagine tracking 100 lightbulbs, recording when each burns out. After one year, 70 have burned out but 30 are still working. You cannot simply ignore those 30 (that biases your failure time estimate downward) or pretend they failed at day 365 (also biased). You need to use the partial information: those 30 survived at least 365 days.

Survival analysis makes principled use of this partial information. The Kaplan-Meier estimator and the Cox model both handle censoring correctly.

Formal Setup and Notation

Let TT be a non-negative random variable representing the true event time. Let CC be a non-negative random variable representing the censoring time. We observe (Y,Δ)(Y, \Delta) where Y=min(T,C)Y = \min(T, C) is the observed time and Δ=1(TC)\Delta = \mathbf{1}(T \leq C) is the event indicator (11 if the event was observed, 00 if censored).

Definition

Survival Function

The survival function is the probability of surviving beyond time tt:

S(t)=P(T>t)=1F(t)S(t) = P(T > t) = 1 - F(t)

where F(t)=P(Tt)F(t) = P(T \leq t) is the cumulative distribution function of TT. The survival function is non-increasing with S(0)=1S(0) = 1.

Definition

Hazard Function

The hazard function (or hazard rate) is the instantaneous rate of the event at time tt, given survival up to tt:

h(t)=limΔt0P(tT<t+ΔtTt)Δt=f(t)S(t)h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t} = \frac{f(t)}{S(t)}

where f(t)f(t) is the density of TT. The hazard can increase, decrease, or be constant over time.

Definition

Cumulative Hazard

The cumulative hazard is:

H(t)=0th(u)du=logS(t)H(t) = \int_0^t h(u) \, du = -\log S(t)

This connects the hazard to the survival function via S(t)=exp(H(t))S(t) = \exp(-H(t)).

Core Definitions

Right censoring is the most common type: you know the subject survived at least until time CC, but you do not know the true event time. The clean right-censoring cases are administrative (the study ends) and dropout, both under an independent-censoring assumption.

A competing event is not ordinary right censoring. A competing event (e.g. death from another cause when studying time to relapse) is an event that prevents the event of interest from occurring, not merely a stop in observation. Treating competing events as independent right censoring with Kaplan-Meier estimates a hypothetical net-of-competing-causes survival curve, not the real-world probability of the event before competing causes, and biases cumulative-incidence estimates upward. Competing risks require dedicated tools — cause-specific hazards, the cumulative incidence function (Aalen-Johansen estimator), and Fine-Gray subdistribution-hazard models — chosen according to the target estimand.

The key assumption for most survival methods is non-informative censoring: the censoring mechanism is independent of the event process, conditional on covariates. If sicker patients drop out more, censoring is informative and standard methods can be biased.

Definition

Kaplan-Meier Estimator

Let t1<t2<<tKt_1 < t_2 < \cdots < t_K be the distinct observed event times. At each tjt_j, let djd_j be the number of events and njn_j the number of subjects at risk (alive and uncensored just before tjt_j). The Kaplan-Meier estimator of the survival function is:

S^(t)=j:tjt(1djnj)\hat{S}(t) = \prod_{j: t_j \leq t} \left(1 - \frac{d_j}{n_j}\right)

This is a step function that drops at each observed event time.

Definition

Cox Proportional Hazards Model

The Cox model specifies the hazard for a subject with covariates xx as:

h(tx)=h0(t)exp(βTx)h(t \mid x) = h_0(t) \exp(\beta^T x)

where h0(t)h_0(t) is an unspecified baseline hazard and β\beta are regression coefficients. The model is semiparametric: h0(t)h_0(t) is left completely unspecified, and inference on β\beta uses partial likelihood.

Main Theorems

Theorem

Consistency of the Kaplan-Meier Estimator

Statement

Under non-informative right censoring with nn i.i.d. observations, the Kaplan-Meier estimator is uniformly consistent: for any τ\tau such that P(Yτ)>0P(Y \geq \tau) > 0,

suptτS^(t)S(t)a.s.0 as n\sup_{t \leq \tau} |\hat{S}(t) - S(t)| \xrightarrow{a.s.} 0 \text{ as } n \to \infty

Furthermore, n(S^(t)S(t))\sqrt{n}(\hat{S}(t) - S(t)) converges to a Gaussian process, and the pointwise variance is given by Greenwood's formula:

Var^(S^(t))=S^(t)2j:tjtdjnj(njdj)\widehat{\text{Var}}(\hat{S}(t)) = \hat{S}(t)^2 \sum_{j: t_j \leq t} \frac{d_j}{n_j(n_j - d_j)}

Intuition

Each factor (1dj/nj)(1 - d_j/n_j) in the Kaplan-Meier product estimates the conditional probability of surviving past tjt_j given survival to tjt_j. By the law of large numbers, each factor converges to the true conditional survival probability. The product of consistent factors is consistent.

Proof Sketch

Write S^(t)=(1dj/nj)\hat{S}(t) = \prod (1 - d_j/n_j) and take logs: logS^(t)=log(1dj/nj)\log \hat{S}(t) = \sum \log(1 - d_j/n_j). Each term is approximately dj/nj-d_j/n_j, which estimates the discrete hazard increment. By the Glivenko- Cantelli theorem applied to the Nelson-Aalen estimator H^(t)=dj/nj\hat{H}(t) = \sum d_j/n_j, the cumulative hazard converges uniformly. Then S^(t)=exp(H^(t))\hat{S}(t) = \exp(-\hat{H}(t)) converges uniformly via the continuous mapping theorem.

Why It Matters

The Kaplan-Meier estimator is the nonparametric workhorse of survival analysis. This theorem guarantees it works: given enough data, it accurately recovers the true survival curve even when some observations are censored.

Failure Mode

The estimator becomes unreliable in the right tail where few subjects remain at risk (njn_j is small). Confidence intervals widen dramatically. The estimator is undefined beyond the largest observed time if that observation is censored.

Theorem

Cox Partial Likelihood

Statement

Under the Cox model h(tx)=h0(t)exp(βTx)h(t \mid x) = h_0(t) \exp(\beta^T x), the regression coefficients β\beta can be estimated by maximizing the partial likelihood:

L(β)=j=1Kexp(βTx(j))iRjexp(βTxi)L(\beta) = \prod_{j=1}^{K} \frac{\exp(\beta^T x_{(j)})}{\sum_{i \in \mathcal{R}_j} \exp(\beta^T x_i)}

where x(j)x_{(j)} is the covariate vector of the subject who failed at tjt_j and Rj={i:Yitj}\mathcal{R}_j = \{i : Y_i \geq t_j\} is the risk set at time tjt_j. The maximum partial likelihood estimator β^\hat{\beta} is consistent and asymptotically normal.

Intuition

At each event time, the partial likelihood asks: given that someone in the risk set failed, what is the probability it was the subject who actually failed? This depends only on β\beta (through the relative hazards), not on h0(t)h_0(t). By conditioning on the event times, we eliminate the nuisance parameter h0(t)h_0(t).

Proof Sketch

Condition the full likelihood on the observed event times and the number of events at each time. The baseline hazard h0(t)h_0(t) cancels in the conditional probability because it appears in both numerator and denominator. The resulting expression is Cox's partial likelihood. Consistency and asymptotic normality follow from general theory of estimating equations (the partial likelihood score is an unbiased estimating equation).

Why It Matters

The Cox model is the most widely used regression model in survival analysis because it avoids specifying the baseline hazard. You can estimate covariate effects β\beta without making any parametric assumption about how the baseline risk changes over time. This flexibility is why it dominates clinical trials and industrial reliability.

Failure Mode

The proportional hazards assumption requires that hazard ratios between groups are constant over time. If treatment A is better early but worse late, the Cox model gives a misleading average effect. Always check this assumption with Schoenfeld residuals or log-log survival plots.

Log-Rank Test

The log-rank test is the standard nonparametric test for equality of two survival curves. Let t1<t2<<tKt_1 < t_2 < \cdots < t_K be the distinct event times pooled across both groups. At each tjt_j, let d1j,d2jd_{1j}, d_{2j} be the event counts and n1j,n2jn_{1j}, n_{2j} the at-risk counts in groups 1 and 2, with totals dj=d1j+d2jd_j = d_{1j} + d_{2j} and nj=n1j+n2jn_j = n_{1j} + n_{2j}.

Under the null hypothesis H0:S1(t)=S2(t)H_0: S_1(t) = S_2(t), the expected number of events in group 1 at tjt_j conditional on djd_j is E1j=djn1j/njE_{1j} = d_j n_{1j} / n_j, with hypergeometric variance V1j=dj(njdj)n1jn2j/(nj2(nj1))V_{1j} = d_j (n_j - d_j) n_{1j} n_{2j} / (n_j^2 (n_j - 1)). The log-rank statistic is:

Z=j=1K(d1jE1j)j=1KV1jZ = \frac{\sum_{j=1}^{K} (d_{1j} - E_{1j})}{\sqrt{\sum_{j=1}^{K} V_{1j}}}

Asymptotically, ZN(0,1)Z \sim \mathcal{N}(0, 1) under H0H_0, so Z2χ12Z^2 \sim \chi^2_1. The log-rank test is the score test from the Cox model with a single binary covariate, which makes it especially well matched to proportional-hazards alternatives — it is locally most powerful against contiguous PH alternatives. Its actual power is not "full"; it depends on effect size, censoring distribution, sample size and group allocation, and the baseline hazard. The test loses power when hazards cross, since early and late deviations partially cancel.

Accelerated Failure Time (AFT) Models

AFT models are an alternative to Cox proportional hazards. Instead of multiplying the hazard, covariates rescale time:

logT=βTx+ϵ\log T = \beta^T x + \epsilon

where ϵ\epsilon has a specified parametric distribution. Equivalently, S(tx)=S0(texp(βTx))S(t \mid x) = S_0(t \cdot \exp(-\beta^T x)), so covariates accelerate (shrink time) or decelerate (stretch time) progression to the event.

Common choices for ϵ\epsilon give named models:

  • Extreme value (Gumbel) gives the Weibull AFT, the only distribution that is both AFT and proportional hazards
  • Normal gives the log-normal AFT
  • Logistic gives the log-logistic AFT

AFT coefficients have a direct interpretation on event time: exp(βk)\exp(\beta_k) is the time ratio for a unit change in covariate xkx_k. When proportional hazards fails but a parametric time-scale interpretation is defensible, AFT is often preferable to a misspecified Cox model. Estimation is by full likelihood since both the baseline and the covariate structure are parametric.

Time-Varying Covariates in Cox

Many covariates change during follow-up: biomarker levels, treatment dosage, employment status. Let xi(t)x_i(t) denote the covariate vector of subject ii at time tt. The Cox model extends to

h(txi(t))=h0(t)exp(βTxi(t))h(t \mid x_i(t)) = h_0(t) \exp(\beta^T x_i(t))

and the partial log-likelihood becomes:

(β)=i:Δi=1[βTxi(ti)logjRiexp(βTxj(ti))]\ell(\beta) = \sum_{i: \Delta_i = 1} \left[ \beta^T x_i(t_i) - \log \sum_{j \in \mathcal{R}_i} \exp(\beta^T x_j(t_i)) \right]

At each event time tit_i, the risk set contributions use each subject's covariate value at that exact time, not at baseline. Implementation uses the counting-process (start, stop, event) data format, with one row per interval during which xix_i is constant.

Distinguish external covariates (their path is defined regardless of the subject's survival, such as calendar time or ambient temperature) from internal covariates (measurements on the subject, such as CD4 count). Internal covariates that are affected by treatment can block causal effects and bias inference when used as regression adjustments.

Deep Survival Models

Replacing βTx\beta^T x with a flexible function class gives neural and tree-based survival models that preserve the partial likelihood structure.

DeepSurv (Katzman et al. 2018, arXiv:1606.00931) replaces the linear predictor with a neural network fθ(x)f_\theta(x):

h(tx)=h0(t)exp(fθ(x))h(t \mid x) = h_0(t) \exp(f_\theta(x))

Training maximizes the Cox partial likelihood with fθf_\theta in place of βTx\beta^T x. The model captures nonlinear interactions and has matched or beaten Cox on several clinical datasets.

Random Survival Forests (Ishwaran et al. 2008) build an ensemble of survival trees, splitting on the log-rank statistic at each node. Each tree returns a cumulative hazard function, and the forest averages them. RSF makes no proportional-hazards assumption and handles high-dimensional covariates.

DeepHit (Lee et al. 2018) is a fully parametric alternative that directly outputs a discrete survival distribution via softmax, handling competing risks without a Cox-style factorization.

Canonical Examples

Example

Clinical trial with censoring

A drug trial enrolls 5 patients. Event times (death or end of study) are: patient 1: death at month 3, patient 2: censored at month 5 (dropped out), patient 3: death at month 7, patient 4: death at month 7, patient 5: censored at month 12 (study ended).

The Kaplan-Meier estimate: at t=3t = 3, S^(3)=11/5=0.8\hat{S}(3) = 1 - 1/5 = 0.8. At t=7t = 7, 3 subjects remain at risk (patients 3, 4, 5), and 2 die, so S^(7)=0.8×(12/3)=0.267\hat{S}(7) = 0.8 \times (1 - 2/3) = 0.267. Patient 2's censoring at month 5 reduces the risk set but does not create a step in the survival curve.

Common Confusions

Watch Out

Censored observations are not missing data

Censored observations carry information: the event did not happen before time CC. Dropping censored observations wastes data and biases the analysis (you would systematically underestimate survival times). The Kaplan-Meier and Cox methods use all observations.

Watch Out

The Cox model does not model the survival function directly

The Cox model models the hazard, not the survival function. To get survival predictions, you need to estimate h0(t)h_0(t) separately (e.g., via the Breslow estimator). The partial likelihood only gives you β\beta, which tells you how covariates affect relative risk.

Summary

  • Censoring means you observe the event for some subjects but not all; ignoring it introduces bias
  • The Kaplan-Meier estimator handles censoring by reducing the risk set at censoring times without counting them as events
  • The hazard h(t)h(t) is the instantaneous event rate given survival; it relates to survival via S(t)=exp(0th(u)du)S(t) = \exp(-\int_0^t h(u) du)
  • The Cox model h(tx)=h0(t)exp(βTx)h(t \mid x) = h_0(t) \exp(\beta^T x) avoids specifying the baseline hazard using partial likelihood
  • Always check the proportional hazards assumption when using the Cox model

Exercises

ExerciseCore

Problem

Five subjects have observed times Y=(2,5+,6,8+,9)Y = (2, 5^+, 6, 8^+, 9) where +^+ denotes censoring. Compute the Kaplan-Meier survival estimate at each observed event time.

ExerciseAdvanced

Problem

In the Cox model, show that the partial likelihood does not depend on the baseline hazard h0(t)h_0(t). Start from the conditional probability that subject (j)(j) fails at time tjt_j given that exactly one failure occurs at tjt_j from the risk set Rj\mathcal{R}_j.

ExerciseResearch

Problem

The proportional hazards assumption is restrictive. Describe two situations where it fails and what alternative models you might use instead.

References

Canonical:

  • Cox, Regression Models and Life-Tables (1972), JRSS Series B
  • Kalbfleisch & Prentice, The Statistical Analysis of Failure Time Data (2nd ed., 2002), Ch 4-6 for Cox and partial likelihood
  • Klein & Moeschberger, Survival Analysis: Techniques for Censored and Truncated Data (2nd ed., 2003), Ch 4 (Kaplan-Meier), Ch 7 (log-rank), Ch 8-9 (Cox)
  • Andersen, Borgan, Gill & Keiding, Statistical Models Based on Counting Processes (1993), Aalen counting-process framework

Applied and diagnostic:

  • Kleinbaum & Klein, Survival Analysis: A Self-Learning Text (3rd ed., 2012, Springer), Ch 1-4 for Kaplan-Meier, log-rank, Cox
  • Therneau & Grambsch, Modeling Survival Data: Extending the Cox Model (2000), Cox diagnostics and time-varying covariates

Machine learning:

  • Katzman et al., DeepSurv: Personalized Treatment Recommender with Cox Proportional Hazards Deep Neural Network (2018), arXiv:1606.00931
  • Ishwaran et al., Random Survival Forests (2008), Annals of Applied Statistics
  • Kvamme et al., Time-to-Event Prediction with Neural Networks and Cox Regression (JMLR 2019)

Next Topics

Natural continuations from survival analysis:

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

2