Zero-truncated poisson regression is used to model count data for which the value zero cannot occur.
library(VGAM) library(boot) library(ggplot2)
Example 1. A study of length of hospital stay, in days, as a function of age, kind of health insurance and whether or not the patient died while in the hospital. Length of hospital stay is recorded as a minimum of at least one day.
Example 2. A study of the number of journal articles published by tenured faculty as a function of discipline (fine arts, science, social science, humanities, medical, etc). To get tenure faculty must publish, therefore, there are no tenured faculty with zero publications.
Example 3. A study by the county traffic court on the number of tickets received by teenagers as predicted by school performance, amount of driver training and gender. Only individuals who have received at least one citation are in the traffic court files.
Let's pursue Example 1 from above.
We have a hypothetical data file, ztp.csv
with 1,493 observations. The length of hospital stay variable is stay
. The variable age
gives the age group from 1 to 9 which will be treated as interval in this example. The variables hmo
and died
are binary indicator variables for HMO insured patients and patients who died while in the hospital, respectively.
Let's look at the data.
dat <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/ztp.csv") dat <- within(dat, { hmo <- factor(hmo) died <- factor(died) }) summary(dat)
## stay age hmo died ## Min. : 1.000 Min. :1.000 0:1254 0:981 ## 1st Qu.: 4.000 1st Qu.:4.000 1: 239 1:512 ## Median : 8.000 Median :5.000 ## Mean : 9.729 Mean :5.234 ## 3rd Qu.:13.000 3rd Qu.:6.000 ## Max. :74.000 Max. :9.000
Now let's look at some graphs of the data conditional on various combinations of the variables to get a sense of how the variables work together. We will use the ggplot2
package. First we can look at histograms of stay
broken down by hmo
on the rows and died
on the columns. We also include the marginal distributions, thus the lower right corner represents the overall histogram. We use a log base 10 scale to approximate the canonical link function of the poisson distribution (natural logarithm).
ggplot(dat, aes(stay)) + geom_histogram() + scale_x_log10() + facet_grid(hmo ~ died, margins=TRUE, scales="free_y")
hmo
and died
, with shorter stays for those in HMOs (1) and shorter for those who did die, including what seems to be an inflated number of 1 day stays. To examine how stay
varies across age groups, we can use conditional violin plots which show a kernel density estimate of the distribution of stay mirrored (hence the violin) and conditional on each age group. To further understand the raw data going into each density estimate, we add raw data on top of the violin plots with a small amount of random noise (jitter) to alleviate over plotting. Finally, to get a sense of the overall trend, we add a locally weighted regression line.
ggplot(dat, aes(factor(age), stay)) + geom_violin() + geom_jitter(size=1.5) + scale_y_log10() + stat_smooth(aes(x = age, y = stay, group=1), method="loess")
ggplot(dat, aes(age, fill=died)) + geom_histogram(binwidth=.5, position="fill") + facet_grid(hmo ~ ., margins=TRUE)
Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.
To fit the zero-truncated poisson model, we use the vglm
function in the VGAM
package. This function fits a very flexible class of models called vector generalized linear models to a wide range of assumed distributions. In our case, we believe the data are poisson, but without zeros. Thus the values are strictly positive poisson, for which we use the positive poisson family via the pospoisson
function passed to vglm
.
m1 <- vglm(stay ~ age + hmo + died, family = pospoisson(), data = dat) summary(m1)
## ## Call: ## vglm(formula = stay ~ age + hmo + died, family = pospoisson(), ## data = dat) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## loge(lambda) -3.032 -1.727 -0.5868 0.9794 20.84 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 2.435808 0.027332 89.118 < 2e-16 *** ## age -0.014442 0.005035 -2.868 0.00412 ** ## hmo1 -0.135903 0.023742 -5.724 1.04e-08 *** ## died1 -0.203771 0.018373 -11.091 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Number of linear predictors: 1 ## ## Name of linear predictor: loge(lambda) ## ## Dispersion Parameter for pospoisson family: 1 ## ## Log-likelihood: -6908.799 on 1489 degrees of freedom ## ## Number of iterations: 4
vglm
can fit far more complex models, for this page, this will always be one, the expected mean, \(\lambda\).Now let's look at a plot of the residuals versus fitted values. We add random horizontal noise as well as 50 percent transparency to alleviate over plotting and better see where most residuals fall.
output <- data.frame(resid = resid(m1), fitted = fitted(m1)) ggplot(output, aes(fitted, resid)) + geom_jitter(position=position_jitter(width=.25), alpha=.5) + stat_smooth(method="loess")
ggplot(output, aes(fitted, resid)) + geom_jitter(position=position_jitter(width=.25), alpha=.5) + stat_quantile(method="rq")
output <- within(output, { broken <- cut(fitted, hist(fitted, plot=FALSE)$breaks) }) ggplot(output, aes(broken, resid)) + geom_boxplot() + geom_jitter(alpha=.25)
age
, -0.0144 suggests that the log count of stay decreases by 0.0144 for each year increase in age.hmo
, -0.1359 indicates that the log count of stay for HMO patient is 0.1359 less than for non-HMO patients.We can get confidence intervals for the parameters and the exponentiated parameters using bootstrapping. For the Poisson model, these would be incident risk ratios. We use the boot
package. First, we get the coefficients from our original model to use as start values for the model to speed up the time it takes to estimate. Then we write a short function that takes data and indices as input and returns the parameters we are interested in. Finally, we pass that to the boot
function and do 1200 replicates, using snow to distribute across four cores. Note that you should adjust the number of cores to whatever your machine has. Also, for final results, one may wish to increase the number of replications to help ensure stable results.
dput(round(coef(m1),3))
## structure(c(2.436, -0.014, -0.136, -0.204), .Names = c("(Intercept)", ## "age", "hmo1", "died1"))
f <- function(data, i) { require(VGAM) m <- vglm(formula = stay ~ age + hmo + died, family = pospoisson(), data = data[i, ], coefstart = c(2.436, -0.014, -0.136, -0.204)) as.vector(t(coef(summary(m))[, 1:2])) } set.seed(10) res <- boot(dat, f, R = 1200, parallel = "snow", ncpus = 4) ## print results res
## ## ORDINARY NONPARAMETRIC BOOTSTRAP ## ## ## Call: ## boot(data = dat, statistic = f, R = 1200, parallel = "snow", ## ncpus = 4) ## ## ## Bootstrap Statistics : ## original bias std. error ## t1* 2.43580824 2.123876e-04 6.962189e-02 ## t2* 0.02733232 8.530996e-06 5.621009e-04 ## t3* -0.01444203 -5.728615e-05 1.211602e-02 ## t4* 0.00503471 2.530820e-06 9.896101e-05 ## t5* -0.13590330 1.161636e-03 5.104657e-02 ## t6* 0.02374183 1.343953e-05 7.496660e-04 ## t7* -0.20377089 -1.486727e-03 4.984121e-02 ## t8* 0.01837281 3.871559e-05 3.556392e-04
Now we can get the confidence intervals for all the parameters. We start on the original scale with percentile and basic bootstrap CIs.
## basic parameter estimates with percentile and bias adjusted CIs parms <- t(sapply(c(1, 3, 5, 7), function(i) { out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "basic")) with(out, c(Est = t0, pLL = percent[4], pUL = percent[5], basicLL = basic[4], basicLL = basic[5])) })) ## add row names row.names(parms) <- names(coef(m1)) ## print results parms
## Est pLL pUL basicLL basicLL ## (Intercept) 2.43580824 2.2993325 2.57780532 2.29381116 2.57228397 ## age -0.01444203 -0.0400288 0.00979822 -0.03868227 0.01114475 ## hmo1 -0.13590330 -0.2387489 -0.03807353 -0.23373306 -0.03305769 ## died1 -0.20377089 -0.3053032 -0.10627717 -0.30126462 -0.10223856
Now we can estimate the incident risk ratio (IRR) for the Poisson model. This is done using almost identical code as before, but passing a transformation function to the h
argument of boot.ci
, in this case, exp
to exponentiate.
## exponentiated parameter estimates with percentile and bias adjusted CIs expparms <- t(sapply(c(1, 3, 5, 7), function(i) { out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "basic"), h = exp) with(out, c(Est = t0, pLL = percent[4], pUL = percent[5], basicLL = basic[4], basicLL = basic[5])) })) ## add row names row.names(expparms) <- names(coef(m1)) ## print results expparms
## Est pLL pUL basicLL basicLL ## (Intercept) 11.4250491 9.9675271 13.1682091 9.6818891 12.8825712 ## age 0.9856618 0.9607618 1.0098464 0.9614771 1.0105617 ## hmo1 0.8729270 0.7876126 0.9626422 0.7832119 0.9582415 ## died1 0.8156492 0.7368999 0.8991754 0.7321230 0.8943985
age
does not have a significant effect, but hmo
and died
both do. In order to better understand our results and model, let's plot some predicted values. Because all of our predictors were categorical (hmo
and died
) or had a small number of unique values (age
) we will get predicted values for all possible combinations. First we create a new data set using the expand.grid
function, then estimate the predicted values using the predict
function, and finally plot them.
newdata <- expand.grid(age = 1:9, hmo = factor(0:1), died = factor(0:1)) newdata$yhat <- predict(m1, newdata, type = "response") ggplot(newdata, aes(x = age, y = yhat, colour = hmo)) + geom_point() + geom_line() + facet_wrap(~ died)
## function to return predicted values fpred <- function(data, i, newdata) { require(VGAM) m <- vglm(formula = stay ~ age + hmo + died, family = pospoisson(), data = data[i, ], coefstart = c(2.436, -0.014, -0.136, -0.204)) predict(m, newdata, type = "response") } ## set seed and run bootstrap with 1,200 draws set.seed(10) respred <- boot(dat, fpred, R = 1200, newdata = newdata, parallel = "snow", ncpus = 4) ## get the bootstrapped percentile CIs yhat <- t(sapply(1:nrow(newdata), function(i) { out <- boot.ci(respred, index = i, type = c("perc")) with(out, c(Est = t0, pLL = percent[4], pUL = percent[5])) })) ## merge CIs with predicted values newdata <- cbind(newdata, yhat) ## graph with CIs ggplot(newdata, aes(x = age, y = yhat, colour = hmo, fill = hmo)) + geom_ribbon(aes(ymin = pLL, ymax = pUL), alpha = .25) + geom_point() + geom_line() + facet_wrap(~ died)
exposure()
option.