We observe \(n\) independent triplets \((X_i, \delta_i, \mathbf{Z}_i), i = 1, \ldots, n\), where
The Cox model is used to estimate the effect of covariates \(\mathbf{Z}_i\) on continuous (but arbitrary) distribution of failure times \(T_i\) and to test its significance.
In this exercise we will deal with covariates constant in time.
We suppose that the hazard function \(\lambda\) of failure time distribution (assumed to be continuous) given covariates \(T|\mathbf{Z}\) is of the form: \[\lambda(t| \mathbf{Z}) = \lambda_0(t) \exp \left\{ \boldsymbol{\beta}^{\mathsf{T}} \mathbf{Z} \right\},\] where:
Notes:
\[\dfrac{\lambda(t| \mathbf{Z}_1)}{\lambda(t| \mathbf{Z}_2)} = \dfrac{\lambda_0(t)}{\lambda_0(t)} \cdot \dfrac{\exp \left\{ \boldsymbol{\beta}^{\mathsf{T}} \mathbf{Z}_1 \right\}}{\exp \left\{ \boldsymbol{\beta}^{\mathsf{T}} \mathbf{Z}_2 \right\}} = \exp \left\{ \boldsymbol{\beta}^{\mathsf{T}} (\mathbf{Z}_1 - \mathbf{Z}_2) \right\} \quad \text{is constant in time } t.\] Interpretation of \(\beta_j\) coefficient can be derived in the usual way: \[\exp \beta_j = \dfrac{\lambda(t| \mathbf{Z} + \mathbf{e}_j)}{\lambda(t| \mathbf{Z})}.\]
In the case of constant regressors we can express the Cox model in the terms of survival functions:
Model is estimated by partial likelihood method that shares properties with classical maximum likelihood method. We can perform tests for:
We can use the Breslow estimator of cumulative baseline hazard function \[ \widehat{\Lambda}_0(t) = \int \limits_{0}^{t} \dfrac{ \, \mathrm{d} \overline{N}(s)}{\sum \limits_{i=1}^n Y_i (s) \exp \left\{ \widehat{\boldsymbol{\beta}}^{\mathsf{T}} \mathbf{Z}_i \right\}} \] to obtain an estimate of general survival function \[ \widehat{S}(t| \mathbf{Z}) = \exp \left\{ - \widehat{\Lambda}(t| \mathbf{Z})\right\} = \exp \left\{ - \widehat{\Lambda}_0(t) \exp \left\{ \widehat{\boldsymbol{\beta}}^{\mathsf{T}} \mathbf{Z} \right\} \right\}. \]
We will work with the dataset pbc
contained in the survival
library, see Exercise 1 or help(pbc)
for description.
library(survival)
data(pbc)
variables <- c("id", "time", "status", "trt", "sex", "edema", "bili", "albumin", "platelet")
data <- pbc[, variables]
We will work with the following variables:
Var. name | Description |
---|---|
id | case number |
status | status at the end of follow-up: 0 = censored, 1 = transplanted, 2 = dead |
time | number of days of follow-up |
trt | treatment: 1 = D-penicillamin, 2 = placebo, NA = not randomised |
sex | sex m/f |
edema | 0 = no edema, 0.5 = untreated or successfully treated, 1 = edema despite diuretic therapy |
bili | serum bilirubin (mg/dl) |
albumin | serum albumin (g/dl) |
platelet | platelet count |
Here are initial rows of the dataset:
head(data)
## id time status trt sex edema bili albumin platelet
## 1 1 400 2 1 f 1.0 14.5 2.60 190
## 2 2 4500 0 1 f 0.0 1.1 4.14 221
## 3 3 1012 2 1 m 0.5 1.4 3.48 151
## 4 4 1925 2 1 f 0.5 1.8 2.54 183
## 5 5 1504 1 2 f 0.0 3.4 3.53 136
## 6 6 2503 2 2 f 0.0 0.8 3.98 NA
We are interested in modelling survival time in untransplanted patients. Patients who had liver transplant will be censored at the moment the transplant was done. Therefore, similarly as in Exercise 1 define \(\delta\) indicator as
data$delta = (data$status == 2)
First, we will start with explorative analysis using methods we already know - non-parametric estimation of survival function (Kaplan-Meier) and logrank test for comparing two (possibly more) different groups.
An illustration of possible output (try to optimize your code so that it can be easily used for several variables):
The function to fit the Cox proportional hazards model is called coxph
. It’s syntax is [example]
fit <- coxph(Surv(time,delta)~log(age)+chol+hepato,data=pbc)
See help(coxph)
for additional parameters such as weights
(in case of repeated observations), na.action
(missing-data filter function), ties
(methods for tie handling), …
You can use function summary
to view \(\boldsymbol{\beta}\) coefficient estimates (including exponentiated versions):
summary(fit)
## Call:
## coxph(formula = Surv(time, delta) ~ log(age) + chol + hepato,
## data = pbc)
##
## n= 284, number of events= 114
## (134 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## log(age) 2.406e+00 1.109e+01 5.231e-01 4.599 4.24e-06 ***
## chol 1.152e-03 1.001e+00 3.134e-04 3.675 0.000238 ***
## hepato 8.941e-01 2.445e+00 2.051e-01 4.358 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## log(age) 11.087 0.09019 3.977 30.910
## chol 1.001 0.99885 1.001 1.002
## hepato 2.445 0.40899 1.636 3.655
##
## Concordance= 0.714 (se = 0.023 )
## Likelihood ratio test= 56.08 on 3 df, p=4e-12
## Wald test = 53.26 on 3 df, p=2e-11
## Score (logrank) test = 57.36 on 3 df, p=2e-12
In our example consider patient of age 50 with hepatomegaly and serum cholesterol 300 mg/dl.
There exists predict.coxph
function that seems to be useful for calculating survival probabilities, however, I strongly recommend not to use it! See Details of predict.coxph
: reference values for baseline are means of covariates.
Estimator of baseline cumulative hazard \(\widehat{\Lambda}_0(t)\) can be obtained using function basehaz
with option centered = F
. General cumulative hazard function estimator can be obtained by combination of basehaz
and corresponding exponentiated linear predictor. Survival function estimator can be obtained using \(S = \exp \left\{ - \Lambda\right\}\) transformation:
L0 <- basehaz(fit, centered = F)
# Manual calculation of exp(linear predictor)
patient_LinPred <- sum(c(log(50), 300, 1)*fit$coefficients)
patient_Lambda <- L0$hazard * exp(patient_LinPred)
par(mfrow = c(1,2), mar = c(4,4,1,1))
plot(patient_Lambda~L0$time, type = "s", col = "skyblue", xlab = "Time [days]", ylab = "Cumulative hazard")
plot(exp(-patient_Lambda)~L0$time, type = "s", col = "skyblue", ylim = c(0,1),
xlab = "Time [days]", ylab = "Survival probability")
More practical way is to supply newdata
to survfit
function applied to fit
object (output of coxph
function):
newdata <- data.frame(age = 50, chol = 300, hepato = 1)
fitnew <- survfit(fit, newdata = newdata, conf.type = "plain")
par(mfrow = c(1,2), mar = c(4,4,1,1))
plot(fitnew, fun = "cumhaz", xlab = "Time [Days]", ylab = "Cumulative hazard", col = "skyblue", conf.int = F)
plot(fitnew, xlab = "Time [Days]", ylab = "Survival probability", col = "skyblue")
You can use libraries ggplot2
and survminer
to obtain visually nice plots:
library(survminer)
library(ggplot2)
newdata <- with(pbc, data.frame(age = c( 50, 50, 50),
chol = c(300, 300, 300),
hepato = c( 1, 0, 0.5176056)))
fitnew <- survfit(fit, newdata = newdata, conf.type = "plain")
ggsurvplot(fitnew, data = newdata,
conf.int = TRUE, censor= FALSE,
legend.labs=c("Patient 1", "Patient 2", "Patient 3"),
ggtheme = theme_minimal())
The logic of specifying the model formula is the same as with functions lm
or glm
. Printing the results and performing tests of hypotheses for model building is also very similar to other regression functions in R:
summary(fit)
anova(fit)
anova(fit1,fit2)
drop1(fit,test="Chisq")
add1(fit, scope = "bili:trt", test = "Chisq")
The parameter estimation is based on the maximum partial likelihood method but the asymptotic properties, tests etc. are all the same as with ordinary likelihood methods. In particular, a submodel is tested against a wider model by likelihood ratio tests performed by functions anova
or drop1
. The difference of log-partial-likelihoods multiplied by 2 has asymptotically \(\chi^2_m\) distribution if the submodel is true, where \(m\) is the difference in the number of parameters. The tests shown in the parameter table of summary
function are Wald tests for zero values of the individual parameters.