Zero-inflated negative binomial regression is for modeling *count variables* with *excessive zeros* and it is usually for *overdispersed count outcome* variables. Furthermore, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently.

require(pscl)

library(MASS) library(boot) library(ggplot2)

**Example 1.**School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts.**Example 2.**The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.

A zero-inflated model assumes that zero outcome is due to two different processes. For instance, in the example of fishing presented here, the two processes are that a subject has gone fishing vs. not gone fishing. If not gone fishing, the only outcome possible is zero. If gone fishing, it is then a count process. The two parts of the a zero-inflated model are a binary model, usually a logit model to model which of the two processes the zero outcome is associated with and a count model, in this case, a negative binomial model, to model the count process.

The zero-inflated negative binomial (ZINB) model is based on the *negative binomial model* with *quadratic variance function* (p=2), i.e., *NB2*. The ZINB model is obtained by specifying a negative binomial distribution for the data generation process referred to earlier as Process 2:
\[
g(y_i)=\frac{\Gamma(y_i+\alpha^{-1})}{y_i!\Gamma(\alpha^{-1})}\left(\frac{\alpha^{-1}}{\alpha^{-1}+\mu_i}\right)^{\alpha^{-1}}\left(\frac{\mu_i}{\alpha^{-1}+\mu_i}\right)^{y_i},\quad y_i=0,1,2,\ldots
\]
Thus the ZINB model is defined to be
\[
\begin{align*}
\mathsf{P}[Y_i=0|{\bf x}_i,{\bf z}_i]&=F({\bf z}_i^{\top}{\boldsymbol\gamma})+(1-F({\bf z}_i^{\top}{\boldsymbol\gamma}))(1+\alpha\mu_i)^{-\alpha^{-1}},\\
\mathsf{P}[Y_i|{\bf x}_i,{\bf z}_i]&=(1-F({\bf z}_i^{\top}{\boldsymbol\gamma}))\frac{\Gamma(y_i+\alpha^{-1})}{y_i!\Gamma(\alpha^{-1})}\left(\frac{\alpha^{-1}}{\alpha^{-1}+\mu_i}\right)^{\alpha^{-1}}\left(\frac{\mu_i}{\alpha^{-1}+\mu_i}\right)^{y_i},\quad y_i>0.
\end{align*}
\]
In this case, the conditional expectation and conditional variance of \(Y_i\) are
\[
\begin{align*}
\mathsf{E}[Y_i|{\bf x}_i,{\bf z}_i]&=\mu_i(1-F({\bf z}_i^{\top}{\boldsymbol\gamma})),\\
\mathsf{Var}[Y_i|{\bf x}_i,{\bf z}_i]&=\mathsf{E}[Y_i|{\bf x}_i,{\bf z}_i](1+\mu_i(F({\bf z}_i^{\top}{\boldsymbol\gamma}+\alpha))).
\end{align*}
\]
As with the ZIP model, the ZINB model exhibits overdispersion because the conditional variance exceeds the conditional mean.

Finally, note that R does not estimate \(\alpha\) but \(\theta\), the inverse of \(\alpha\).

Let's pursue Example 2 from above.

We have data on 250 groups that went to a park. Each group was questioned about how many fish they caught (`count`

), how many children were in the group (`child`

), how many people were in the group (`persons`

), and whether or not they brought a camper to the park (`camper`

).

In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros, i.e., the probability that a group caught zero fish. We will use the variables `child`

, `persons`

, and `camper`

in our model. Let's look at the data.

zinb <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/fish.csv") zinb <- within(zinb, { nofish <- factor(nofish) livebait <- factor(livebait) camper <- factor(camper) }) summary(zinb)

## nofish livebait camper persons child ## 0:176 0: 34 0:103 Min. :1.000 Min. :0.000 ## 1: 74 1:216 1:147 1st Qu.:2.000 1st Qu.:0.000 ## Median :2.000 Median :0.000 ## Mean :2.528 Mean :0.684 ## 3rd Qu.:4.000 3rd Qu.:1.000 ## Max. :4.000 Max. :3.000 ## xb zg count ## Min. :-3.275050 Min. :-5.6259 Min. : 0.000 ## 1st Qu.: 0.008267 1st Qu.:-1.2527 1st Qu.: 0.000 ## Median : 0.954550 Median : 0.6051 Median : 0.000 ## Mean : 0.973796 Mean : 0.2523 Mean : 3.296 ## 3rd Qu.: 1.963855 3rd Qu.: 1.9932 3rd Qu.: 2.000 ## Max. : 5.352674 Max. : 4.2632 Max. :149.000

## histogram with x axis in log10 scale ggplot(zinb, aes(count, fill = camper)) + geom_histogram() + scale_x_log10() + facet_grid(camper ~ ., margins=TRUE, scales="free_y")

Before we show how you can analyze this with a zero-inflated negative binomial analysis, let's consider some other methods that you might use.

*OLS Regression*- You could try to analyze these data using OLS regression. However, count data are highly non-normal and are not well estimated by OLS regression.*Zero-inflated Poisson Regression*- Zero-inflated Poisson regression does better when the data is not overdispersed, i.e. when variance is not much larger than the mean.*Ordinary Count Models*- Poisson or negative binomial models might be more appropriate if there are not excess zeros.

Now let's build up our model. We are going to use the variables `child`

and `camper`

to model the count in the part of negative binomial model and the variable `persons`

in the logit part of the model. We use the `pscl`

to run a zero-inflated negative binomial regression. We begin by estimating the model with the variables of interest.

m1 <- zeroinfl(count ~ child + camper | persons, data = zinb, dist = "negbin", EM = TRUE) summary(m1)

## ## Call: ## zeroinfl(formula = count ~ child + camper | persons, data = zinb, ## dist = "negbin", EM = TRUE) ## ## Pearson residuals: ## Min 1Q Median 3Q Max ## -0.5861 -0.4617 -0.3886 -0.1974 18.0129 ## ## Count model coefficients (negbin with log link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.3711 0.2561 5.353 8.63e-08 *** ## child -1.5152 0.1956 -7.747 9.42e-15 *** ## camper1 0.8790 0.2693 3.264 0.0011 ** ## Log(theta) -0.9854 0.1759 -5.600 2.14e-08 *** ## ## Zero-inflation model coefficients (binomial with logit link): ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.6028 0.8363 1.916 0.0553 . ## persons -1.6663 0.6790 -2.454 0.0141 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Theta = 0.3733 ## Number of iterations in BFGS optimization: 2 ## Log-likelihood: -432.9 on 6 Df

Below the model call, you will find a block of output containing negative binomial regression coefficients for each of the variables along with standard errors, z-scores, and p-values for the coefficients. A second block follows that corresponds to the inflation model. This includes logit coefficients for predicting excess zeros along with their standard errors, z-scores, and p-values.

All of the predictors in both the count and inflation portions of the model are statistically significant. This model fits the data significantly better than the null model, i.e., the intercept-only model. To show that this is the case, we can compare with the current model to a null model without predictors using chi-squared test on the difference of log likelihoods.

m0 <- update(m1, . ~ 1) pchisq(2 * (logLik(m1) - logLik(m0)), df = 3, lower.tail = FALSE)

## 'log Lik.' 1.280471e-13 (df=6)

Note that the model output above does not indicate in any way if our zero-inflated model is an improvement over a standard negative binomial regression. We can determine this by running the corresponding standard negative binomial model and then performing a Vuong test of the two models. We use the `MASS`

package to run the standard negative binomial regression.

summary(m2 <- glm.nb(count ~ child + camper, data = zinb))

## ## Call: ## glm.nb(formula = count ~ child + camper, data = zinb, init.theta = 0.2552931119, ## link = log) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.3141 -1.0361 -0.7266 -0.1720 4.0163 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.0727 0.2425 4.424 9.69e-06 *** ## child -1.3753 0.1958 -7.025 2.14e-12 *** ## camper1 0.9094 0.2836 3.206 0.00135 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for Negative Binomial(0.2553) family taken to be 1) ## ## Null deviance: 258.93 on 249 degrees of freedom ## Residual deviance: 201.89 on 247 degrees of freedom ## AIC: 887.42 ## ## Number of Fisher Scoring iterations: 1 ## ## ## Theta: 0.2553 ## Std. Err.: 0.0329 ## ## 2 x log-likelihood: -879.4210

vuong(m1, m2)

## Vuong Non-Nested Hypothesis Test-Statistic: -3.819662 ## (test-statistic is asymptotically distributed N(0,1) under the ## null that the models are indistinguishible) ## in this case: ## model2 > model1, with p-value 6.6817e-05

- The predictors
`child`

and`camper`

in the part of the negative binomial regression model predicting number of fish caught (`count`

) are both significant predictors. - The predictor
`person`

in the part of the logit model predicting excessive zeros is statistically significant. - For these data, the expected change in log(
`count`

) for a one-unit increase in`child`

is -1.515255 holding other variables constant. - A camper (
`camper`

= 1) has an expected log(`count`

) of 0.879051 higher than that of a non-camper (`camper`

= 0) holding other variables constant. - The log odds of being an excessive zero would decrease by 1.67 for every additional person in the group. In other words, the more people in the group the less likely that the zero would be due to not gone fishing. Put plainly, the larger the group the person was in, the more likely that the person went fishing.
- The Vuong test suggests that the zero-inflated negative binomial model is a significant improvement over a standard negative binomial model.

We can get confidence intervals for the parameters and the exponentiated parameters using bootstrapping. For the negative binomial model, these would be incident risk ratios, for the zero inflation model, odds ratios. We use the `boot`

package. First, we get the coefficients from our original model to use as start values for the model to speed up the time it takes to estimate. Then we write a short function that takes data and indices as input and returns the parameters we are interested in. Finally, we pass that to the `boot`

function and do 1200 replicates, using snow to distribute across four cores. Note that you should adjust the number of cores to whatever your machine has. Also, for final results, one may wish to increase the number of replications to help ensure stable results.

dput(round(coef(m1, "count"), 4))

## structure(c(1.3711, -1.5152, 0.879), .Names = c("(Intercept)", ## "child", "camper1"))

dput(round(coef(m1, "zero"), 4))

## structure(c(1.6028, -1.6663), .Names = c("(Intercept)", "persons" ## ))

f <- function(data, i) { require(pscl) m <- zeroinfl(count ~ child + camper | persons, data = data[i, ], dist = "negbin", start = list(count = c(1.3711, -1.5152, 0.879), zero = c(1.6028, -1.6663))) as.vector(t(do.call(rbind, coef(summary(m)))[, 1:2])) } set.seed(10) (res <- boot(zinb, f, R = 1200, parallel = "snow", ncpus = 4))

## ## ORDINARY NONPARAMETRIC BOOTSTRAP ## ## ## Call: ## boot(data = zinb, statistic = f, R = 1200, parallel = "snow", ## ncpus = 4) ## ## ## Bootstrap Statistics : ## original bias std. error ## t1* 1.3710504 -0.083293503 0.39439368 ## t2* 0.2561136 -0.002616572 0.03190918 ## t3* -1.5152609 -0.061393126 0.26877799 ## t4* 0.1955916 0.006027950 0.02026508 ## t5* 0.8790522 0.091642830 0.47150032 ## t6* 0.2692734 0.001873325 0.01998125 ## t7* -0.9853566 0.080190346 0.21907572 ## t8* 0.1759504 0.002580684 0.01689353 ## t9* 1.6031354 0.473483335 1.59330502 ## t10* 0.8365225 3.767327930 15.65780310 ## t11* -1.6665917 -0.462324216 1.56790036 ## t12* 0.6793077 3.771998568 15.69674577

`zeroinfl`

.
Now we can get the confidence intervals for all the parameters. We start on the original scale with percentile and bias adjusted CIs. We also compare these results with the regular confidence intervals based on the standard errors.

## basic parameter estimates with percentile and bias adjusted CIs parms <- t(sapply(c(1, 3, 5, 7, 9), function(i) { out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "bca")) with(out, c(Est = t0, pLL = percent[4], pUL = percent[5], bcaLL = bca[4], bcaLL = bca[5])) })) ## add row names row.names(parms) <- names(coef(m1)) ## print results parms

## Est pLL pUL bcaLL bcaLL ## count_(Intercept) 1.3710504 0.5132187 2.0539932 0.7563477 2.3268123 ## count_child -1.5152609 -2.1495185 -1.0807496 -1.9866681 -0.9827549 ## count_camper1 0.8790522 0.1119459 1.8628376 -0.2267101 1.6676446 ## zero_(Intercept) -0.9853566 -1.3142565 -0.4465283 -1.4471843 -0.6407219 ## zero_persons 1.6031354 0.3519777 8.0795923 -0.1857838 3.6867011

## compare with normal based approximation confint(m1)

## 2.5 % 97.5 % ## count_(Intercept) 0.86910572 1.8730573 ## count_child -1.89860155 -1.1318879 ## count_camper1 0.35126797 1.4068178 ## zero_(Intercept) -0.03635521 3.2418992 ## zero_persons -2.99701357 -0.3354987

Now we can estimate the incident risk ratio (IRR) for the negative binomial model and odds ratio (OR) for the logistic (zero inflation) model. This is done using almost identical code as before, but passing a transformation function to the `h`

argument of `boot.ci`

, in this case, `exp`

to exponentiate.

## exponentiated parameter estimates with percentile and bias adjusted CIs expparms <- t(sapply(c(1, 3, 5, 7, 9), function(i) { out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "bca"), h = exp) with(out, c(Est = t0, pLL = percent[4], pUL = percent[5], bcaLL = bca[4], bcaLL = bca[5])) })) ## add row names row.names(expparms) <- names(coef(m1)) ## print results expparms

## Est pLL pUL bcaLL bcaLL ## count_(Intercept) 3.9394864 1.6706640 7.7989834 2.1304808 10.2452309 ## count_child 0.2197509 0.1165403 0.3393411 0.1371516 0.3742786 ## count_camper1 2.4086158 1.1184525 6.4420052 0.7971518 5.2996704 ## zero_(Intercept) 0.3733061 0.2686740 0.6398457 0.2352317 0.5269119 ## zero_persons 4.9685866 1.4218768 3227.9195721 0.8304531 39.9129615

`expand.grid`

function to create all combinations and then the `predict`

function to do it. Finally we create a graph.
newdata1 <- expand.grid(0:3, factor(0:1), 1:4) colnames(newdata1) <- c("child", "camper", "persons") newdata1$phat <- predict(m1, newdata1) ggplot(newdata1, aes(x = child, y = phat, colour = factor(persons))) + geom_point() + geom_line() + facet_wrap(~camper) + labs(x = "Number of Children", y = "Predicted Fish Caught")

Here are some issues that you may want to consider in the course of your research analysis.

- Question about the over-dispersion parameter is in general a tricky one. A large over-dispersion parameter could be due to a miss-specified model or could be due to a real process with over-dispersion. Adding an over-dispersion problem does not necessarily improve a miss-specified model.
- The zero inflated negative binomial model has two parts, a negative binomial count model and the logit model for predicting excess zeros, so you might want to review the Negative Binomial Regression and Logit Regression.
- Since zero inflated negative binomial has both a count model and a logit model, each of the two models should have good predictors. The two models do not necessarily need to use the same predictors.
- Problems of perfect prediction, separation or partial separation can occur in the logistic part of the zero-inflated model.
- Count data often use exposure variable to indicate the number of times the event could have happened. You can incorporate exposure (also called an offset) into your model by using the
`offset()`

function. - It is not recommended that zero-inflated negative binomial models be applied to small samples. What constitutes a small sample does not seem to be clearly defined in the literature.
- Pseudo-R-squared values differ from OLS R-squareds.

- UCLA: IDRE (Institute for Digital Research and Education). Data Analysis Examples. from http://www.ats.ucla.edu/stat/dae/ (accessed January 31, 2014)
- SAS 9.3 (2013). PROC COUNTREG help page. SAS Institute, Cary NC.
- Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.