Exact logistic regression is used to model binary outcome variables in which the log odds of the outcome is modeled as a linear combination of the predictor variables. It is used when the sample size is too small for a regular logistic regression (which uses the standard maximum-likelihood-based estimator) and/or when some of the cells formed by the outcome and categorical predictor variable have no observations. The estimates given by exact logistic regression do not depend on asymptotic results.
require(elrm)
Suppose that we are interested in the factors that influence whether or not a high school senior is admitted into a very competitive engineering school. The outcome variable is binary (0/1): admit or not admit. The predictor variables of interest include student gender and whether or not the student took Advanced Placement calculus in high school. Because the response variable is binary, we need to use a model that handles 0/1 outcome variables correctly. Also, because of the number of students involved is small, we will need a procedure that can perform the estimation with a small sample size.
The data for this exact logistic data analysis include the number of students admitted, the total number of applicants broken down by gender (the variable female
), and whether or not they had taken AP calculus (the variable apcalc
). Since the dataset is so small, we will read it in directly.
dat <- read.table(text = " female apcalc admit num 0 0 0 7 0 0 1 1 0 1 0 3 0 1 1 7 1 0 0 5 1 0 1 1 1 1 0 0 1 1 1 6", header = TRUE)
num
variable indicates frequency weight. We use this to expand the dataset and then look at some frequency tables.
## expand dataset by repeating each row num times and drop the num ## variable dat <- dat[rep(1:nrow(dat), dat$num), -4]
## look at various tables xtabs(~female + apcalc, data = dat)
## apcalc ## female 0 1 ## 0 8 10 ## 1 6 6
xtabs(~female + admit, data = dat)
## admit ## female 0 1 ## 0 10 8 ## 1 5 7
xtabs(~apcalc + admit, data = dat)
## admit ## apcalc 0 1 ## 0 12 2 ## 1 3 13
xtabs(~female + apcalc + admit, data = dat)
## , , admit = 0 ## ## apcalc ## female 0 1 ## 0 7 3 ## 1 5 0 ## ## , , admit = 1 ## ## apcalc ## female 0 1 ## 0 1 7 ## 1 1 6
Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations.
Let's run an (approximate) exact logistic analysis using the elrm
command in the elrm
package. This is based on MCMC sampling. It requires a collapsed data set with number of trials and number of successes, so we make that first.
(x <- xtabs(~admit + interaction(female, apcalc), data = dat))
## interaction(female, apcalc) ## admit 0.0 1.0 0.1 1.1 ## 0 7 5 3 0 ## 1 1 1 7 6
# view cross tabs
(cdat <- cdat <- data.frame(female = rep(0:1, 2), apcalc = rep(0:1, each = 2), admit = x[2, ], ntrials = colSums(x)))
## female apcalc admit ntrials ## 0.0 0 0 1 8 ## 1.0 1 0 1 6 ## 0.1 0 1 7 10 ## 1.1 1 1 6 6
# view collapsed data set
elrm
and MCMC sampling. We will do 22,000 iterations with a 2,000 burnin for a final chain of 20,000. Note that for the combined model of female and apcalc, we use a chain of 5 million. This is because for inference, each effect needs at least 1,000, but because the conditional joint distribution is degenerate, for the female effect the ratio of useable trials is low, meaning that to achieve over 1,000, the total iterations must be extremely high.
## model with female predictor only m.female <- elrm(formula = admit/ntrials ~ female, interest = ~female, iter = 22000, dataset = cdat, burnIn = 2000)
## Progress: 0% Progress: 5% Progress: 10% Progress: 15% Progress: 20% Progress: 25% Progress: 30% Progress: 35% Progress: 40% Progress: 45% Progress: 50% Progress: 55% Progress: 60% Progress: 65% Progress: 70% Progress: 75% Progress: 80% Progress: 85% Progress: 90% Progress: 95% Progress: 100%
## summary of model including estimates and CIs summary(m.female)
## ## Call: ## [[1]] ## elrm(formula = admit/ntrials ~ female, interest = ~female, iter = 22000, ## dataset = cdat, burnIn = 2000) ## ## ## Results: ## estimate p-value p-value_se mc_size ## female 0.56478 0.484 0.00284 20000 ## ## 95% Confidence Intervals for Parameters ## ## lower upper ## female -1.129697 2.59968
## trace plot and histogram of sampled values from the sufficient ## statistic plot(m.female$mc, col = "grey")
## model with apcalc predictor only m.apcalc <- elrm(formula = admit/ntrials ~ apcalc, interest = ~apcalc, iter = 22000, dataset = cdat, burnIn = 2000)
## Progress: 0% Progress: 5% Progress: 10% Progress: 15% Progress: 20% Progress: 25% Progress: 30% Progress: 35% Progress: 40% Progress: 45% Progress: 50% Progress: 55% Progress: 60% Progress: 65% Progress: 70% Progress: 75% Progress: 80% Progress: 85% Progress: 90% Progress: 95% Progress: 100%
## Warning: 'apcalc' observed value of the sufficient statistic was not ## sampled
## summary of model including estimates and CIs summary(m.apcalc)
## ## Call: ## [[1]] ## elrm(formula = admit/ntrials ~ apcalc, interest = ~apcalc, iter = 22000, ## dataset = cdat, burnIn = 2000) ## ## ## Results: ## estimate p-value p-value_se mc_size ## apcalc NA 0 0 20000 ## ## 95% Confidence Intervals for Parameters ## ## lower upper ## apcalc NA NA
## trace plot and histogram of sampled values from the sufficient ## statistic plot(m.apcalc$mc, col = "grey")
## run not automated for time purposes results <- elrm(formula = admit/ntrials ~ female + apcalc, interest = ~ female + apcalc, iter = 5005000, dataset = cdat, burnIn = 5000, r = 2)
## Progress: 0% Progress: 5% Progress: 10% Progress: 15% Progress: 20% Progress: 25% Progress: 30% Progress: 35% Progress: 40% Progress: 45% Progress: 50% Progress: 55% Progress: 60% Progress: 65% Progress: 70% Progress: 75% Progress: 80% Progress: 85% Progress: 90% Progress: 95% Progress: 100%
summary(results)
## ## Call: ## [[1]] ## elrm(formula = admit/ntrials ~ female + apcalc, interest = ~female + ## apcalc, r = 2, iter = 5005000, dataset = cdat, burnIn = 5000) ## ## ## Results: ## estimate p-value p-value_se mc_size ## joint NA 0.00047 0.00001 5000000 ## female 1.27347 0.34629 0.01267 1646 ## apcalc 3.95707 0.00031 0.00002 1116420 ## ## 95% Confidence Intervals for Parameters ## ## lower upper ## female -1.157186 5.201046 ## apcalc 1.107982 8.762402
female
and apcalc
, it is the p-value for testing that the individual parameter estimate is zero. Next is the Monte Carlo standard errors for the p-value. Finally, mc_size is the length of the Markov chain of sampled values of sufficient statistics used for each parameter estimate. Note that all the length of the chain for the joint test was 5 million, a mere 1,739 for the female parameter. Because each Markov chain needs to be sufficiently long for stable inference, we needed to increase the total chain size to stably estimate female. For apcalc, the chain length is larger than necessary (over 1 million).