Survival Analysis typically focuses on *time to event data*. In the most general sense, it consists of techniques for positive-valued random variables, such as

- time to death
- time to onset (or relapse) of a disease
- length of a contract
- duration of a policy
- money paid by health insurance
- viral load measurements
- time to finishing a master thesis

Typically, survival data are **not fully observed**, but rather are *censored*.

library(ggplot2) library(survival) library(KMsurv)

**Failure time random variables** are always non-negative. That is, if we denote the failure time by \(T\), then \(T\geq 0\).

A random variable X is called a **censored failure time random variable** if \(X = \min(T,U)\), where \(U\) is a non-negative *censoring variable*.

In order to *define a failure time random variable*, we need:

- an unambiguous
**time origin**(e.g. randomization to clinical trial, purchase of car) - a
**time scale**(e.g. real time (days, years), mileage of a car) - definition of the
**event**(e.g. death, need a new car transmission)

The illustration of survival data shows several features which are typically encountered in analysis of survival data:

- individuals do not all enter the study at the same time
- when the study ends, some individuals still haven't had the event yet
- other individuals drop out or get lost in the middle of the study, and all we know about them is the last time they were still 'free' of the event

The first feature is referred to as 'staggered entry'. The last two features relate to 'censoring' of the failure time events.

Only the r.v. \(X_i=\min(T_i, U_i)\) is observed due to:

- loss to follow-up
- drop-out
- study termination

We call this right-censoring because the true unobserved event is to the right of our censoring time; i.e., all we know is that the event has not happened at the end of follow-up.

In addition to observing \(X_i\), we also get to see the **failure indicator**:
\[
\delta_i=\left\{\begin{array}{c}
1 \mbox{ if } T_i\leq U_i\\
0 \mbox{ if } T_i> U_i
\end{array}\right.
\]

Some software packages instead assume we have a **censoring indicator**:
\[
c_i=\left\{\begin{array}{c}
0 \mbox{ if } T_i\leq U_i\\
1 \mbox{ if } T_i> U_i
\end{array}\right.
\]

Right-censoring is the most common type of censoring assumption we will deal with in survival analysis.

Can only observe \(Y_i=\max(T_i, U_i)\) and the **failure indicators**:
\[
\delta_i=\left\{\begin{array}{c}
1 \mbox{ if } U_i\leq T_i\\
0 \mbox{ if } U_i> T_i
\end{array}\right.
\]
e.g. (Miller) study of age at which African children learn a task. Some already knew (left-censored), some learned during study (exact), some had not yet learned by end of study (right-censored).

Observe \((L_i, R_i)\) where \(T_i\in (L_i, R_i)\).

There are several equivalent ways to characterize the probability distribution of a survival random variable. Some of these are familiar; others are special to survival analysis. We will focus on the following terms:

- The density function \(f(t)\)
- The survivor function \(S(t)\)
- The hazard function \(\lambda(t)\)
- The cumulative hazard function \(\Lambda(t)\)

**discrete**; Suppose that \(T\) takes values in \(a_1, a_2,\ldots\). \[ f(t)=\mathsf{P}[T=t]=\left\{\begin{array}{l} f_j \mbox{ if } t=a_j,\,j=1,2,\ldots\\ 0 \mbox{ otherwise} \end{array}\right. \]**continuous**\[ f(t)=\lim_{\Delta t\to 0_+}\frac{\mathsf{P}[t\leq T\leq t+\Delta t]}{\Delta t} \]

Survivorship (or survivor or survival) function \(S(t)=\mathsf{P}[T\geq t]\).

In other settings, the cumulative distribution function, \(F(t)=\mathsf{P}[T\leq t]\), is of interest. In survival analysis, our interest tends to focus on the *survival function*, \(S(t)\).

**discrete**\[ S(t)=\sum_{a_j\geq t}f_j \]**continuous**\[ S(t)=\int_t^{\infty}f(u)\mbox{d}u \]

*Remarks:*

- From the definition of \(S(t)\) for a continuous variable, \(S(t)=1-F(t)\) as long as \(f(t)\) is absolutely continuous
- For a discrete variable, we have to decide what to do if an event occurs exactly at time \(t\); i.e., does that become part of \(F(t)\) or \(S(t)\)?
- To get around this problem, several books define \(S(t)=Pr(T>t)\), or else define \(F(t)=\mathsf{P}[T < t]\) (e.g. Collett)

Sometimes called an *instantaneous failure rate*, the *force of mortality*, or the *age-specific failure rate*.

**discrete**\[ \lambda_j=\mathsf{P}[T=a_j|T\geq a_j]=\frac{f_j}{\sum_{k:\,a_k\geq a_j}f_k} \]**continuous**\[ \lambda(t)=\lim_{\Delta t\to 0_+}\frac{\mathsf{P}[t\leq T\leq t+\Delta t|T\geq t]}{\Delta t}=\frac{f(t)}{S(t)} \]

**discrete**\[ \Lambda_j=\sum_{j:\,a_j\le t}\lambda_j \]**continuous**\[ \Lambda(t)=\int_0^t\lambda(u)\mbox{d}u \]

\(S(t)\) and \(\lambda(t)\) for a *continuous* r. v.
\[
\lambda(t)=-\frac{\mbox{d}}{\mbox{d}t}[\log S(t)]
\]

\(S(t)\) and \(\Lambda(t)\) for a *continuous* r. v.
\[
S(t)=\exp\{-\Lambda(t)\}
\]

\(S(t)\) and \(\Lambda(t)\) for a *discrete* r. v. (suppose that \(a_j < t \leq a_{j+1}\))
\[
S(t)=\prod_{j:\,a_j < t}(1-\lambda_j)
\]
Cox defines \(\Lambda(t)=\sum_{j:\,a_j < t}\log (1-\lambda_j)\) so that \(S(t)=\exp\{-\Lambda(t)\}\) in the discrete case, as well.

*Mean survival* - call this \(\mu\)

**discrete**\[ \mu=\sum_j a_j f_j \]**continuous**\[ \mu=\int_0^{\infty} uf(u)\mbox{d}u \]

*Median survival* - call this \(\tau\), is defined by
\[
S(\tau)=0.5
\]
Similarly, any other percentile could be defined.

In practice, we don't usually hit the median survival at exactly one of the failure times. In this case, the estimated median survival is the *smallest* time \(\tau\) such that \(S(\tau)\leq 0.5\)

We can estimate the survival (or hazard) function in two ways:

- by specifying a parametric model for \(\lambda(t)\) based on a particular density function \(f(t)\)
- by developing an empirical estimate of the survival function (i.e., non-parametric estimation)

The empirical estimate of the survival function, \(\widehat{S}(t)\), is the proportion of individuals with event times greater than \(t\).

If there are censored observations, then \(\widehat{S}(t)\) is not a good estimate of the true \(S(t)\), so other non-parametric methods must be used to account for censoring (life-table methods, Kaplan-Meier estimator)

Next we will discuss the most famous non-parametric approach for estimating the survival distribution, called the *Kaplan-Meier estimator*.

To motivate the derivation of this estimator, we will first consider a set of survival times where there is *no censoring*.

The following are times to relapse (weeks) for 21 leukemia patients receiving control treatment (Table 1.1 of Cox & Oakes):

1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23

How would we estimate \(S(10)\), the probability that an individual survives to time 10 or later?

What about \(\widehat{S}(8)\)? Is it \(12/21\) or \(8/21\)?

When there is *no censoring*, the general formula is:
\[
\widehat{S}(t)=\frac{\# \mbox{ of individuals with } T\geq t}{\mbox{total sample size}}
\]
In most software packages, the survival function is evaluated just after time \(t\), i.e., at \(t^+\). In this case, we only count the individuals with \(T > t\).

What if there *is censoring*? [Note: times with \(+\) are right censored.]

6+, 6, 6, 6, 7, 9+, 10+, 10, 11+, 13, 16, 17+, 19+, 20+, 22, 23, 25+, 32+, 32+, 34+, 35+

We know \(\widehat{S}(6)=21/21\), because everyone survived at least until time 6 or greater. But, we can't say \(\widehat{S}(7)=17/21\), because we don't know the status of the person who was censored at time 6.

In a 1958 paper in the *Journal of the American Statistical Association*, Kaplan and Meier proposed a way to nonparametrically estimate \(S(t)\), even in the presence of censoring. The method is based on the ideas of *conditional probability*.

Suppose \(a_k < t \leq a_{k+1}\). Then
\[
S(t)=\mathsf{P}[T\geq a_{k+1}]=\mathsf{P}[T\geq a_1,\ldots,T\geq a_{k+1}]=\prod_{j=1}^{k}\{1-\mathsf{P}[T=a_j|T\geq a_j]\}=\prod_{j=1}^{k}\{1-\lambda_j\}
\]
So
\[
\widehat{S}(t)=\prod_j^{ a_j < t }\left(1-\frac{r_j}{d_j}\right)
\]
where \(d_j\) is the *number of deaths* at \(a_j\) and \(r_j\) is the *number at risk* at \(a_j\).

Think of dividing the observed timespan of the study into a series of fine intervals so that there is a separate interval for each time of death or censoring:

_ , _ , D , _ , C , _ , C , D , D , D

Using the law of conditional probability, \[ \mathsf{P}[T\geq t]=\prod_j\mathsf{P}[\mbox{survive $j$ th interval } I_j | \mbox{ survived to start of } I_j] \] where the product is taken over all the intervals including or preceding time \(t\).

4 possibilities for each interval:

**No events (death or censoring)**- conditional probability of surviving the interval is 1**Censoring**- assume they survive to the end of the interval, so that the conditional probability of surviving the interval is 1**Death, but no censoring**- conditional probability of not surviving the interval is # deaths (\(d\)) divided by # 'at risk' (\(r\)) at the beginning of the interval. So the conditional probability of surviving the interval is \(1 - (d/r)\)**Tied deaths and censoring**- assume censorings last to the end of the interval, so that conditional probability of surviving the interval is still \(1 - (d/r)\)

It turns out we can write a general formula for the conditional probability of surviving the j-th interval that holds for all 4 cases.

We could use the same approach by grouping the event times into intervals (say, one interval for each month), and then counting up the number of deaths (events) in each to estimate the probability of surviving the interval (this is called the *lifetable estimate*).

However, the assumption that those censored last until the end of the interval wouldn't be quite accurate, so we would end up with a cruder approximation.

As the intervals get finer and finer, the approximations made in estimating the probabilities of getting through each interval become smaller and smaller, so that the estimator converges to the true \(S(t)\).

This intuition clarifies why an alternative name for the KM is the *product limit estimator*.

\[ \widehat{S}(t)=\prod_j^{ \tau_j < t }\left(1-\frac{r_j}{d_j}\right) \] where

- \(\tau_1,\ldots,\tau_K\) is the set of \(K\) distinct death times observed in the sample
- \(d_j\) is the number of deaths at \(\tau_j\)
- \(r_j\) is the number of individuals 'at risk' right before the j-th death time (everyone dead or censored at or after that time).
- \(c_j\) is the number of censored observations between the j-th and (j+1)-st death times. Censorings tied at \(\tau_j\) are included in \(c_j\)

*Remarks:*

- \(r_j=r_{j-1}-d_{j-1}-c_{j-1}\)
- \(r_j=\sum_{l\geq j}(c_l+d_l)\)
- \(\widehat{S}(t^+)\) changes only ath the death (failure) times
- \(\widehat{S}(t^+)\) is 1 up to the first death time
- \(\widehat{S}(t^+)\) only goes to 0 if the last event is a death

The file `leukAB.dat`

contains the failure/censoring times and the event indicators for the Leukemia example (treatment arm).

# read dataset, skip 2 lines of file: leuk = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/leukAB.dat", sep=" ", head=T, skip=4) # print a few observations; t = time, f = event indicator leuk[1:3,]

## t f ## 1 6 0 ## 2 6 1 ## 3 6 1

fit.leuk = survfit(Surv(leuk$t, leuk$f) ~ 1) summary(fit.leuk)

## Call: survfit(formula = Surv(leuk$t, leuk$f) ~ 1) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 6 21 3 0.857 0.0764 0.720 1.000 ## 7 17 1 0.807 0.0869 0.653 0.996 ## 10 15 1 0.753 0.0963 0.586 0.968 ## 13 12 1 0.690 0.1068 0.510 0.935 ## 16 11 1 0.627 0.1141 0.439 0.896 ## 22 7 1 0.538 0.1282 0.337 0.858 ## 23 6 1 0.448 0.1346 0.249 0.807

plot(fit.leuk, xlab="Weeks", ylab="Proportion Survivors", col=3)

*Variance* of KM:
\[
\mathsf{var}\widehat{S}(t)=\left[\widehat{S}(t)\right]^2\sum_j^{ \tau_j < t }\frac{d_j}{r_j(d_j-r_j)}
\]
For a 95% confidence interval, we could use
\[
\widehat{S}(t)\pm z_{1-\alpha/2}\mathsf{se}\left(\widehat{S}(t)\right)
\]
where \(\mathsf{se}\left(\widehat{S}(t)\right)\) is calculated using Greenwood's formula.

**Problem**: This approach can yield values \(> 1\) or \(< 0\).

**Better approach**: Get a 95% confidence interval for \(L(t) = \log(- \log(S(t)))\), i.e., *log-log approach*.
Since this quantity is unrestricted, the confidence interval will be in the proper range when we transform back.

Applying the *delta method*, we get:
\[
\mathsf{var}\widehat{L}(t)=\frac{1}{\left[\log \widehat{S}(t)\right]^2}\sum_j^{ \tau_j < t }\frac{d_j}{r_j(d_j-r_j)}
\]
We take the square root of the above to get \(\mathsf{se}\left(\widehat{L}(t)\right)\), and then form the confidence intervals as:
\[
\widehat{S}(t)^{\exp\left\{\pm 1.96\mathsf{se}\left(\widehat{L}(t)\right)\right\}}
\]
`R`

gives an option to calculate these bounds (use `conf.type='log-log'`

in `survfit`

).

`R`

(by default) uses a *log transformation* to stabilize the variance and allow for non-symmetric confidence intervals. This is what is normally done for the confidence interval of an estimated odds ratio.
\[
\mathsf{var}\left(\log \widehat{L}(t)\right)=\sum_j^{ \tau_j < t }\frac{d_j}{r_j(d_j-r_j)}
\]

summary(fit.leuk) # CI = "log"

## Call: survfit(formula = Surv(leuk$t, leuk$f) ~ 1) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 6 21 3 0.857 0.0764 0.720 1.000 ## 7 17 1 0.807 0.0869 0.653 0.996 ## 10 15 1 0.753 0.0963 0.586 0.968 ## 13 12 1 0.690 0.1068 0.510 0.935 ## 16 11 1 0.627 0.1141 0.439 0.896 ## 22 7 1 0.538 0.1282 0.337 0.858 ## 23 6 1 0.448 0.1346 0.249 0.807

fit.leuk2 = survfit(Surv(leuk$t, leuk$f) ~ 1, conf.type="log-log") summary(fit.leuk2)

## Call: survfit(formula = Surv(leuk$t, leuk$f) ~ 1, conf.type = "log-log") ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 6 21 3 0.857 0.0764 0.620 0.952 ## 7 17 1 0.807 0.0869 0.563 0.923 ## 10 15 1 0.753 0.0963 0.503 0.889 ## 13 12 1 0.690 0.1068 0.432 0.849 ## 16 11 1 0.627 0.1141 0.368 0.805 ## 22 7 1 0.538 0.1282 0.268 0.747 ## 23 6 1 0.448 0.1346 0.188 0.680

fit.leuk3 = survfit(Surv(leuk$t, leuk$f) ~ 1, conf.type="plain") summary(fit.leuk3) # notice CI symmetrical about S_hat

## Call: survfit(formula = Surv(leuk$t, leuk$f) ~ 1, conf.type = "plain") ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 6 21 3 0.857 0.0764 0.707 1.000 ## 7 17 1 0.807 0.0869 0.636 0.977 ## 10 15 1 0.753 0.0963 0.564 0.942 ## 13 12 1 0.690 0.1068 0.481 0.900 ## 16 11 1 0.627 0.1141 0.404 0.851 ## 22 7 1 0.538 0.1282 0.286 0.789 ## 23 6 1 0.448 0.1346 0.184 0.712

Using ggplot2 ggsurv function for KM estimnator.

ggsurv <- function(s, CI = 'def', plot.cens = T, surv.col = 'gg.def', cens.col = 'red', lty.est = 1, lty.ci = 2, cens.shape = 3, back.white = F, xlab = 'Time', ylab = 'Survival', main = ''){ library(ggplot2) strata <- ifelse(is.null(s$strata) ==T, 1, length(s$strata)) stopifnot(length(surv.col) == 1 | length(surv.col) == strata) stopifnot(length(lty.est) == 1 | length(lty.est) == strata) ggsurv.s <- function(s, CI = 'def', plot.cens = T, surv.col = 'gg.def', cens.col = 'red', lty.est = 1, lty.ci = 2, cens.shape = 3, back.white = F, xlab = 'Time', ylab = 'Survival', main = ''){ dat <- data.frame(time = c(0, s$time), surv = c(1, s$surv), up = c(1, s$upper), low = c(1, s$lower), cens = c(0, s$n.censor)) dat.cens <- subset(dat, cens != 0) col <- ifelse(surv.col == 'gg.def', 'black', surv.col) pl <- ggplot(dat, aes(x = time, y = surv)) + xlab(xlab) + ylab(ylab) + ggtitle(main) + geom_step(col = col, lty = lty.est) pl <- if(CI == T | CI == 'def') { pl + geom_step(aes(y = up), color = col, lty = lty.ci) + geom_step(aes(y = low), color = col, lty = lty.ci) } else (pl) pl <- if(plot.cens == T & length(dat.cens) > 0){ pl + geom_point(data = dat.cens, aes(y = surv), shape = cens.shape, col = cens.col) } else if (plot.cens == T & length(dat.cens) == 0){ stop ('There are no censored observations') } else(pl) pl <- if(back.white == T) {pl + theme_bw() } else (pl) pl } ggsurv.m <- function(s, CI = 'def', plot.cens = T, surv.col = 'gg.def', cens.col = 'red', lty.est = 1, lty.ci = 2, cens.shape = 3, back.white = F, xlab = 'Time', ylab = 'Survival', main = '') { n <- s$strata groups <- factor(unlist(strsplit(names (s$strata), '='))[seq(2, 2*strata, by = 2)]) gr.name <- unlist(strsplit(names(s$strata), '='))[1] gr.df <- vector('list', strata) ind <- vector('list', strata) n.ind <- c(0,n); n.ind <- cumsum(n.ind) for(i in 1:strata) ind[[i]] <- (n.ind[i]+1):n.ind[i+1] for(i in 1:strata){ gr.df[[i]] <- data.frame( time = c(0, s$time[ ind[[i]] ]), surv = c(1, s$surv[ ind[[i]] ]), up = c(1, s$upper[ ind[[i]] ]), low = c(1, s$lower[ ind[[i]] ]), cens = c(0, s$n.censor[ ind[[i]] ]), group = rep(groups[i], n[i] + 1)) } dat <- do.call(rbind, gr.df) dat.cens <- subset(dat, cens != 0) pl <- ggplot(dat, aes(x = time, y = surv, group = group)) + xlab(xlab) + ylab(ylab) + ggtitle(main) + geom_step(aes(col = group, lty = group)) col <- if(length(surv.col == 1)){ scale_colour_manual(name = gr.name, values = rep(surv.col, strata)) } else{ scale_colour_manual(name = gr.name, values = surv.col) } pl <- if(surv.col[1] != 'gg.def'){ pl + col } else {pl + scale_colour_discrete(name = gr.name)} line <- if(length(lty.est) == 1){ scale_linetype_manual(name = gr.name, values = rep(lty.est, strata)) } else {scale_linetype_manual(name = gr.name, values = lty.est)} pl <- pl + line pl <- if(CI == T) { if(length(surv.col) > 1 && length(lty.est) > 1){ stop('Either surv.col or lty.est should be of length 1 in order to plot 95% CI with multiple strata') }else if((length(surv.col) > 1 | surv.col == 'gg.def')[1]){ pl + geom_step(aes(y = up, color = group), lty = lty.ci) + geom_step(aes(y = low, color = group), lty = lty.ci) } else{pl + geom_step(aes(y = up, lty = group), col = surv.col) + geom_step(aes(y = low,lty = group), col = surv.col)} } else {pl} pl <- if(plot.cens == T & length(dat.cens) > 0){ pl + geom_point(data = dat.cens, aes(y = surv), shape = cens.shape, col = cens.col) } else if (plot.cens == T & length(dat.cens) == 0){ stop ('There are no censored observations') } else(pl) pl <- if(back.white == T) {pl + theme_bw() } else (pl) pl } pl <- if(strata == 1) {ggsurv.s(s, CI , plot.cens, surv.col , cens.col, lty.est, lty.ci, cens.shape, back.white, xlab, ylab, main) } else {ggsurv.m(s, CI, plot.cens, surv.col , cens.col, lty.est, lty.ci, cens.shape, back.white, xlab, ylab, main)} pl }

ggsurv(fit.leuk)

**Applies when the data are grouped.**

Our goal is still to estimate the survival function, hazard, and density function, but this is complicated by the fact that we don't know exactly when during each time interval an event occurs.

**Notation:**

- the j-th time interval is \([t_{j-1},t_j)\)
- \(c_j\) - the number of censorings in the j-th interval
- \(d_j\) - the number of failures in the j-th interval
- \(r_j\) - the number
*entering*the interval

We could apply the K-M formula directly. However, this approach is unsatisfactory for grouped data ... it treats the problem as though it were in discrete time, with events happening only at 1 yr, 2 yr, etc. In fact, what we are trying to calculate here is the conditional probability of dying *within the interval*, given survival to the beginning of it.

We can assume that censorings occur on average *halfway* through the interval:
\[
r_j'=r_j-c_j/2
\]
The assumption yields the *Actuarial Estimator*. It is appropriate if censorings occur uniformly throughout the interval.

**Remarks:**

- Because the intervals are defined as \([t_{j-1}, t_j)\), the first interval typically starts with \(t_0 = 0\).
`R`

and`SAS`

estimate the survival function at the left-hand endpoint, \(S(t_{j-1})\).`Stata`

at the right-hand one.- The implication in
`R`

and`SAS`

is that \(\widehat{S}(t_0)=1\).

- Midpoint \(t_{mj}=(t_j+t_{j-1})/2\)
- Width \(b_j=t_j-t_{j-1}\)
- Conditional probability of dying \(\widehat{q}_j=d_j/r_j'\)
- Conditional probability of surviving \(\widehat{p}_j=1-\widehat{q}_j\)
- Cumulative probability of surviving at \(t_j\): \(\widehat{S}(t)=\prod_{l\leq j}\widehat{p}_l\)
- Hazard in the j-th interval \[ \widehat{\lambda}(t_{mj})=\frac{\widehat{q}_j}{b_j(1-\widehat{q}_j/2)} \] the number of deaths in the interval divided by the average number of survivors at the midpoint
- Density at the midpoint of the j-th interval \[ \widehat{f}(t_{mj})=\frac{\widehat{S}(t_{j-1})\widehat{q}_j}{b_j} \]

`R`

with `KMsurv`

package requires three elements:

- a vector of \(k + 1\) interval endpoints
- a \(k\) vector of death counts in each interval
- a \(k\) vector of censored counts in each interval

Incidentally, KM in `KMsurv`

stands for Klein and Moeschberger, not Kaplan-Meier!

intEndpts = 0:16 # interval endpoints deaths = c(456, 226, 152, 171, 135, 125, 83, 74, 51, 42, 43, 34, 18, 9, 6, 0) cens = c(0, 39, 22, 23, 24, 107, 133, 102, 68, 64, 45, 53, 33, 27, 23, 30) # n censored fitlt = lifetab(tis = intEndpts, ninit=2418, nlost=cens, nevent=deaths) round(fitlt, 4) # restrict output to 4 decimals

## nsubs nlost nrisk nevent surv pdf hazard se.surv se.pdf ## 0-1 2418 0 2418.0 456 1.0000 0.1886 0.2082 0.0000 0.0080 ## 1-2 1962 39 1942.5 226 0.8114 0.0944 0.1235 0.0080 0.0060 ## 2-3 1697 22 1686.0 152 0.7170 0.0646 0.0944 0.0092 0.0051 ## 3-4 1523 23 1511.5 171 0.6524 0.0738 0.1199 0.0097 0.0054 ## 4-5 1329 24 1317.0 135 0.5786 0.0593 0.1080 0.0101 0.0049 ## 5-6 1170 107 1116.5 125 0.5193 0.0581 0.1186 0.0103 0.0050 ## 6-7 938 133 871.5 83 0.4611 0.0439 0.1000 0.0104 0.0047 ## 7-8 722 102 671.0 74 0.4172 0.0460 0.1167 0.0105 0.0052 ## 8-9 546 68 512.0 51 0.3712 0.0370 0.1048 0.0106 0.0050 ## 9-10 427 64 395.0 42 0.3342 0.0355 0.1123 0.0107 0.0053 ## 10-11 321 45 298.5 43 0.2987 0.0430 0.1552 0.0109 0.0063 ## 11-12 233 53 206.5 34 0.2557 0.0421 0.1794 0.0111 0.0068 ## 12-13 146 33 129.5 18 0.2136 0.0297 0.1494 0.0114 0.0067 ## 13-14 95 27 81.5 9 0.1839 0.0203 0.1169 0.0118 0.0065 ## 14-15 59 23 47.5 6 0.1636 0.0207 0.1348 0.0123 0.0080 ## 15-16 30 30 15.0 0 0.1429 NA NA 0.0133 NA ## se.hazard ## 0-1 0.0097 ## 1-2 0.0082 ## 2-3 0.0076 ## 3-4 0.0092 ## 4-5 0.0093 ## 5-6 0.0106 ## 6-7 0.0110 ## 7-8 0.0135 ## 8-9 0.0147 ## 9-10 0.0173 ## 10-11 0.0236 ## 11-12 0.0306 ## 12-13 0.0351 ## 13-14 0.0389 ## 14-15 0.0549 ## 15-16 NA

The National Center for Health Services Research studied 36 for-profit nursing homes to assess the effects of different financial incentives on length of stay. 'Treated' nursing homes received higher per diems for Medicaid patients, and bonuses for improving a patient's health and sending them home.

Study included 1601 patients admitted between May 1, 1981 and April 30, 1982.

*Variables include*:

`los`

- Length of stay of a resident (in days)`age`

- Age of a resident`rx`

- Nursing home assignment (1:bonuses, 0:no bonuses)`gender`

- Gender (1:male, 0:female)`married`

- (1:married, 0:not married)`health`

- Health status (2:second best, 5:worst)`censor`

- Censoring indicator (1:censored, 0:discharged)

Suppose we wish to use the actuarial method, but the data *do not come grouped*.

Consider the treated nursing home patients, with length of stay (`los`

) grouped into 100 day intervals:

nurshome = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/NursingHome.dat", head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor")) nurshome$int = floor(nurshome$los/100)*100 # group in 100 day intervals nurshome = nurshome[nurshome$rx==1,] # keep only treated homes nurshome[1:3,] # check out a few observations

## los age rx gender married health censor int ## 1 37 86 1 0 0 2 0 0 ## 2 61 77 1 0 0 4 0 0 ## 3 1084 75 1 0 0 3 1 1000

tab = table(nurshome$int, nurshome$censor) intEndpts = (0:11)*100 ntotal = dim(nurshome)[1] # nr patients cens = tab[,2] # nr censored in each interval released = tab[,1] # nr released in each interval fitlt = lifetab(tis = intEndpts, ninit=ntotal, nlost=cens, nevent=released) round(fitlt, 4) # restrict output to 4 decimals

## nsubs nlost nrisk nevent surv pdf hazard se.surv se.pdf ## 0-100 712 0 712.0 330 1.0000 0.0046 0.0060 0.0000 2e-04 ## 100-200 382 0 382.0 86 0.5365 0.0012 0.0025 0.0187 1e-04 ## 200-300 296 0 296.0 65 0.4157 0.0009 0.0025 0.0185 1e-04 ## 300-400 231 0 231.0 38 0.3244 0.0005 0.0018 0.0175 1e-04 ## 400-500 193 1 192.5 32 0.2711 0.0005 0.0018 0.0167 1e-04 ## 500-600 160 0 160.0 13 0.2260 0.0002 0.0008 0.0157 1e-04 ## 600-700 147 0 147.0 13 0.2076 0.0002 0.0009 0.0152 1e-04 ## 700-800 134 30 119.0 10 0.1893 0.0002 0.0009 0.0147 0e+00 ## 800-900 94 29 79.5 4 0.1734 0.0001 0.0005 0.0143 0e+00 ## 900-1000 61 30 46.0 4 0.1647 0.0001 0.0009 0.0142 1e-04 ## 1000-1100 27 27 13.5 0 0.1503 NA NA 0.0147 NA ## se.hazard ## 0-100 3e-04 ## 100-200 3e-04 ## 200-300 3e-04 ## 300-400 3e-04 ## 400-500 3e-04 ## 500-600 2e-04 ## 600-700 3e-04 ## 700-800 3e-04 ## 800-900 3e-04 ## 900-1000 5e-04 ## 1000-1100 NA

**Estimated Survival**

names(fitlt) # check out components of fitlt object

## [1] "nsubs" "nlost" "nrisk" "nevent" "surv" ## [6] "pdf" "hazard" "se.surv" "se.pdf" "se.hazard"

x = rep(intEndpts, rep(2,12))[2:23] y = rep(fitlt$surv, rep(2,11))

plot(x, y, type="l", col=4, xlab="Time Interval (Days)", ylab="Survival (Life Table)") title(main = "Duration of stay in nursing homes", cex=.6)

**Estimated Hazard**

y = rep(fitlt$hazard, rep(2,11))

plot(x, y, type="l", col=6, xlab="Time Interval (Days)", ylab="Hazard (Life Table)") title(main = "Duration of stay in nursing homes", cex=.6)

**Estimating the cumulative hazard**

\(\Lambda(t)\) can then be approximated by a sum: \[ \widehat{\Lambda}(t)=\sum_j\lambda_j\Delta \] where the sum is over intervals, \(\lambda_j\) is the value of the hazard in the j-th interval and \(\Delta\) is the width of each interval. Since \(\lambda_j\Delta\) is approximately the probability of dying in the interval, we can further approximate by \[ \widehat{\Lambda}(t)=\sum_j\frac{d_j}{r_j} \] It follows that \(\Lambda(t)\) will change only at death times, and hence we write the Nelson-Aalen estimator as: \[ \widehat{\Lambda}_{NA}(t)=\sum_{j}^{ \tau_j < t }\frac{d_j}{r_j} \]

Once we have \(\widehat{\Lambda}_{NA}(t)\), we can also find another estimator of \(S(t)\) (Fleming-Harrington): \[ \widehat{S}_{FM}(t)=\exp\{-\widehat{\Lambda}_{NA}(t)\} \]

Say we want to obtain the Fleming-Harrington estimate of the survival function for married females, in the healthiest initial subgroup, who are randomized to the untreated group of the nursing home study.

nurshome = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/NursingHome.dat", head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor")) (nurs2 = subset(nurshome, gender==0 & rx==0 & health==2 & married==1))

## los age rx gender married health censor ## 144 89 78 0 0 1 2 0 ## 362 123 75 0 0 1 2 0 ## 427 38 78 0 0 1 2 0 ## 601 24 82 0 0 1 2 0 ## 719 185 86 0 0 1 2 0 ## 736 113 73 0 0 1 2 0 ## 1120 25 71 0 0 1 2 0 ## 1343 14 73 0 0 1 2 0 ## 1362 149 81 0 0 1 2 0 ## 1461 168 72 0 0 1 2 0 ## 1472 64 81 0 0 1 2 0 ## 1489 234 72 0 0 1 2 0

fit.fh = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1, type="fleming-harrington", conf.type="log-log") summary(fit.fh)

## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "fleming-harrington", ## conf.type = "log-log") ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 14 12 1 0.9200 0.0801 0.5244 0.989 ## 24 11 1 0.8401 0.1085 0.4750 0.960 ## 25 10 1 0.7601 0.1267 0.4056 0.920 ## 38 9 1 0.6802 0.1388 0.3368 0.872 ## 64 8 1 0.6003 0.1465 0.2718 0.819 ## 89 7 1 0.5204 0.1502 0.2116 0.760 ## 113 6 1 0.4405 0.1505 0.1564 0.696 ## 123 5 1 0.3606 0.1472 0.1070 0.628 ## 149 4 1 0.2809 0.1404 0.0641 0.556 ## 168 3 1 0.2012 0.1299 0.0293 0.483 ## 185 2 1 0.1221 0.1169 0.0059 0.422 ## 234 1 1 0.0449 Inf 0.0000 1.000

# compare with KM: fit.km = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1, type="kaplan-meier", conf.type="log-log") summary(fit.km)

## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "kaplan-meier", ## conf.type = "log-log") ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 14 12 1 0.9167 0.0798 0.53898 0.988 ## 24 11 1 0.8333 0.1076 0.48171 0.956 ## 25 10 1 0.7500 0.1250 0.40842 0.912 ## 38 9 1 0.6667 0.1361 0.33702 0.860 ## 64 8 1 0.5833 0.1423 0.27014 0.801 ## 89 7 1 0.5000 0.1443 0.20848 0.736 ## 113 6 1 0.4167 0.1423 0.15247 0.665 ## 123 5 1 0.3333 0.1361 0.10270 0.588 ## 149 4 1 0.2500 0.1250 0.06014 0.505 ## 168 3 1 0.1667 0.1076 0.02651 0.413 ## 185 2 1 0.0833 0.0798 0.00505 0.311 ## 234 1 1 0.0000 NaN NA NA

\[ \begin{align*} H_0&:\,S_1(t)=S_2(t),\,\forall t\\ H_1&:\,\exists t:\,S_1(t)\neq S_2(t) \end{align*} \]

The logrank test is the most well known and widely used.

It also has an intuitive appeal, building on standard methods for binary data. (Later we will see that it can also be obtained as the score test from a partial likelihood from the Cox Proportional Hazards model.)

The logrank test is obtained by constructing a 2x2 table at each distinct death time, and comparing the death rates between the two groups, conditional on the number at risk in the groups. The tables are then combined using the Cochran-Mantel-Haenszel test. Note that the logrank is sometimes called the *Cox-Mantel* test.

leuk = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/leuk.dat", sep=" ", head=T, skip=7) summary(leuk)

## weeks remiss trtmt ## Min. : 1.00 Min. :0.0000 Min. :0.0 ## 1st Qu.: 6.00 1st Qu.:0.0000 1st Qu.:0.0 ## Median :10.50 Median :1.0000 Median :0.5 ## Mean :12.88 Mean :0.7143 Mean :0.5 ## 3rd Qu.:18.50 3rd Qu.:1.0000 3rd Qu.:1.0 ## Max. :35.00 Max. :1.0000 Max. :1.0

## Log Rank test (leuk.lrt = survdiff(Surv(weeks,remiss) ~ trtmt, data = leuk))

## Call: ## survdiff(formula = Surv(weeks, remiss) ~ trtmt, data = leuk) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## trtmt=0 21 21 10.7 9.77 16.8 ## trtmt=1 21 9 19.3 5.46 16.8 ## ## Chisq= 16.8 on 1 degrees of freedom, p= 4.17e-05

- The logrank statistic depends on
*ranks*of event times only. - It does not matter which group you choose.
- Analogous to the CMH test for a series of tables at different levels of a confounder, the logrank test is most powerful when 'odds ratios' are constant over time intervals. That is, it is most powerful for
*proportional hazards*.

**Logrank test as a linear rank test**

The logrank test can be derived by assigning scores to the ranks of the death times. This is called the Peto-Peto logrank test, or the linear rank logrank test.

*CMH-type Logrank*: We motivated the logrank test through the CMH statistic for testing \(H_0:\,OR=1\) over \(K\) tables, where \(K\) is the number of distinct death times.*Peto-Peto (linear rank) Logrank*: The linear rank version of the logrank test is based on adding up 'scores' for one of the two treatment groups. The particular scores that gave us the same logrank statistic were based on the Nelson-Aalen estimator.- If there are no tied event times, then the two versions of the test will yield identical results. The more ties we have, the more it matters which version we use.
- The Peto-Peto (linear rank) logrank test is
*not available*in`R`

. Use the CMH logrank test instead.

This is the Prentice (1978) modification of the Peto & Peto (1972) modification of the Gehan (1965) modification of the Wilcoxon (1945) rank test.

## Peto-Peto-Prentice-Gehan-Wilcoxon Rank test ## notice: rho=1 (leuk.pppgw = survdiff(Surv(weeks,remiss) ~ trtmt, data = leuk, rho=1))

## Call: ## survdiff(formula = Surv(weeks, remiss) ~ trtmt, data = leuk, ## rho = 1) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## trtmt=0 21 14.55 7.68 6.16 14.5 ## trtmt=1 21 5.12 12.00 3.94 14.5 ## ## Chisq= 14.5 on 1 degrees of freedom, p= 0.000143

leuk.km = survfit(Surv(weeks, remiss) ~ trtmt, data=leuk)

plot(leuk.km, xlab = "Weeks", ylab = "Survival", col=c(2,3)) legend(18, .95, legend=c("No treatment", "Treatment"), col=c(2:3), lty=1) title(main="Treatment Comparison, Leukemia Data", cex=.7)

(ggkm <- ggsurv(leuk.km))

*Logrank test: CMH or Linear Rank?*If there are not too many ties, then it doesn't really matter.*Logrank test or (PPP)Gehan-Wilcoxon?*This is a more important choice.- Both tests have the correct level (Type I error) for testing the
*null hypothesis*of equal survival, \(H_0:\,S_1(t)=S_2(t)\). - The choice of which test to use depends on the
*alternative hypothesis*. This drives the*power*of the test. - The Gehan-Wilcoxon test is sensitive to
*early differences*in survival between groups. - The Logrank test is sensitive to
*later differences*. - The logrank is most powerful under the assumption of
*proportional hazards*, \(\lambda_1(t) = \alpha\lambda_2(t)\), which implies an alternative in terms of the survival functions of \(H_1\, :S_1(t) = [S_2(t)]^{\alpha}\). - The Wilcoxon has high power when the
*failure times are lognormally distributed*, with equal variance in both groups but a different mean. It will turn out that this is the assumption of an accelerated failure time model. - Both tests will lack power if the survival curves (or hazards) 'cross' (not proportional hazards). However, that does not necessarily make them invalid!

- Both tests have the correct level (Type I error) for testing the

There are more than two groups, and the question of interest is whether the groups differ from each other.

**Example**: Time taken to finish a test with 3 different noise distractions. All tests were stopped after 12 minutes.

time = c(9, 9.5, 9, 8.5, 10, 10.5, 10, 12, 12, 11, 12, 10.5, 12, 12, 12, 12, 12, 12) event = c(1,1,1,1,1,1,1,1,0,1,1,1,1,0,0,0,0,0) group = rep(1:3, times=c(6,6,6)) (cbind(group,time,event))

## group time event ## [1,] 1 9.0 1 ## [2,] 1 9.5 1 ## [3,] 1 9.0 1 ## [4,] 1 8.5 1 ## [5,] 1 10.0 1 ## [6,] 1 10.5 1 ## [7,] 2 10.0 1 ## [8,] 2 12.0 1 ## [9,] 2 12.0 0 ## [10,] 2 11.0 1 ## [11,] 2 12.0 1 ## [12,] 2 10.5 1 ## [13,] 3 12.0 1 ## [14,] 3 12.0 0 ## [15,] 3 12.0 0 ## [16,] 3 12.0 0 ## [17,] 3 12.0 0 ## [18,] 3 12.0 0

(noise.lrt = survdiff(Surv(time, event) ~ group, rho=0))

## Call: ## survdiff(formula = Surv(time, event) ~ group, rho = 0) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## group=1 6 6 1.57 12.4463 17.2379 ## group=2 6 5 4.53 0.0488 0.0876 ## group=3 6 1 5.90 4.0660 9.4495 ## ## Chisq= 20.4 on 2 degrees of freedom, p= 3.75e-05

noise.km = survfit(Surv(time, event) ~ group) plot(noise.km, col = 2:4, xlab = "Minutes", ylab = "Time to finish test", main = "Time to finish test for 3 noise levels")

Sometimes, even though we are interested in comparing two groups (or maybe \(P\)) groups, we know there are other factors that also affect the outcome. It would be useful to adjust for these other factors in some way.

**Example**: For the nursing home data, a logrank test comparing length of stay for those under and over 85 years of age suggests a significant difference (\(p=0.03\)).
However, we know that gender has a strong association with length of stay, and also age. Hence, it would be a good idea to *stratify* the analysis by gender when trying to assess the age effect.

A stratified logrank allows one to *compare groups*, but allows the *shapes of the hazards of the different groups to differ across strata*. It makes the assumption that the group 1 vs group 2 hazard ratio is constant across strata. In other words: \(\lambda_{1s}(t)/\lambda_{2s}(t)=\theta\) where \(\theta\) is constant over the strata (\(s =1,\ldots,S\)).

nurshome = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/NursingHome.dat", head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor")) nurshome$discharged = 1 - nurshome$censor # event indicator nurshome$under85 = (nurshome$age < 85) + 0 (nurs.slrt = survdiff( Surv(los, discharged) ~ under85 + strata(gender), data = nurshome))

## Call: ## survdiff(formula = Surv(los, discharged) ~ under85 + strata(gender), ## data = nurshome) ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## under85=0 689 537 560 0.926 1.73 ## under85=1 912 742 719 0.721 1.73 ## ## Chisq= 1.7 on 1 degrees of freedom, p= 0.189

- Collett:
*Modelling Survival Data in Medical Research* - Cox & Oakes:
*Analysis of Survival Data* - Kalbfleisch & Prentice:
*The Statistical Analysis of Failure Time Data* - Lee:
*Statistical Methods for Survival Data Analysis* - Fleming & Harrington:
*Counting Processes and Survival Analysis* - Hosmer & Lemeshow:
*Applied Survival Analysis* - Kleinbaum:
*Survival Analysis: A self-learning text* - Klein & Moeschberger:
*Survival Analysis: Techniques for censored and truncated data* - Cantor:
*Extending SAS Survival Analysis Techniques for Medical Research* - Allison:
*Survival Analysis Using the SAS System* - Jennison & Turnbull:
*Group Sequential Methods with Applications to Clinical Trials*