Multinomial logistic regression is used to model nominal outcome variables, in which the log odds of the outcomes are modeled as a linear combination of the predictor variables.
library(foreign) library(nnet) library(ggplot2) library(reshape2)
Example 1. People's occupational choices might be influenced by their parents' occupations and their own education level. We can study the relationship of one's occupation choice with education level and father's occupation. The occupational choices will be the outcome variable which consists of categories of occupations.
Example 2. A biologist may be interested in food choices that alligators make. Adult alligators might have different preferences from young ones. The outcome variable here will be the types of food, and the predictor variables might be size of the alligators and other environmental variables.
Example 3. Entering high school students make program choices among general program, vocational program and academic program. Their choice might be modeled using their writing score and their social economic status.
For our data analysis example, we will expand the third example using the hsbdemo data set. Let's first read in the data.
ml <- read.dta("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/hsbdemo.dta")
prog
, program type. The predictor variables are social economic status, ses
, a three-level categorical variable and writing score, write
, a continuous variable. Let's start with getting some descriptive statistics of the variables of interest.
with(ml, table(ses, prog))
## prog ## ses general academic vocation ## low 16 19 12 ## middle 20 44 31 ## high 9 42 7
with(ml, do.call(rbind, tapply(write, prog, function(x) c(M = mean(x), SD = sd(x)))))
## M SD ## general 51.33333 9.397775 ## academic 56.25714 7.943343 ## vocation 46.76000 9.318754
Below we use the multinom
function from the nnet package to estimate a multinomial logistic regression model. There are other functions in other R packages capable of multinomial regression. We chose the multinom
function because it does not require the data to be reshaped (as the mlogit
package does) and to mirror the example code found in Hilbe's Logistic Regression Models.
Before running our model. We then choose the level of our outcome that we wish to use as our baseline and specify this in the relevel
function. Then, we run our model using multinom
. The multinom
package does not include p-value calculation for the regression coefficients, so we calculate p-values using Wald tests (here z-tests).
ml$prog2 <- relevel(ml$prog, ref = "academic") test <- multinom(prog2 ~ ses + write, data = ml)
## # weights: 15 (8 variable) ## initial value 219.722458 ## iter 10 value 179.982880 ## final value 179.981726 ## converged
summary(test)
## Call: ## multinom(formula = prog2 ~ ses + write, data = ml) ## ## Coefficients: ## (Intercept) sesmiddle seshigh write ## general 2.852198 -0.5332810 -1.1628226 -0.0579287 ## vocation 5.218260 0.2913859 -0.9826649 -0.1136037 ## ## Std. Errors: ## (Intercept) sesmiddle seshigh write ## general 1.166441 0.4437323 0.5142196 0.02141097 ## vocation 1.163552 0.4763739 0.5955665 0.02221996 ## ## Residual Deviance: 359.9635 ## AIC: 375.9635
(z <- summary(test)$coefficients/summary(test)$standard.errors)
## (Intercept) sesmiddle seshigh write ## general 2.445214 -1.2018081 -2.261334 -2.705562 ## vocation 4.484769 0.6116747 -1.649967 -5.112689
(p <- (1 - pnorm(abs(z), 0, 1)) * 2)
## (Intercept) sesmiddle seshigh write ## general 0.0144766100 0.2294379 0.02373856 6.818902e-03 ## vocation 0.0000072993 0.5407530 0.09894976 3.176045e-07
The model summary output has a block of coefficients and a block of standard errors. Each of these blocks has one row of values corresponding to a model equation. Focusing on the block of coefficients, we can look at the first row comparing prog = "general"
to our baseline prog = "academic"
and the second row comparing prog = "vocation"
to our baseline prog = "academic"
. If we consider our coefficients from the first row to be \(\beta_{1}\) and our coefficients from the second row to be \(\beta_{2}\), we can write our model equations:
\[
\log\frac{\mathsf{P}[prog=general]}{\mathsf{P}[prog=academic]}=\beta_{10}+\beta_{11}(ses=2)+\beta_{12}(ses=3)+\beta_{13}write\\
\log\frac{\mathsf{P}[prog=vocation]}{\mathsf{P}[prog=academic]}=\beta_{20}+\beta_{21}(ses=2)+\beta_{22}(ses=3)+\beta_{23}write
\]
ses="low"
to ses="high"
(\(\beta_{12}\)).ses="low"
to ses="middle"
(\(\beta_{11}\)), although this coefficient is not significant.ses="low"
to ses="high"
(\(\beta_{22}\)).ses="low"
to ses="middle"
(\(\beta_{21}\)), although this coefficient is not signficant.The ratio of the probability of choosing one outcome category over the probability of choosing the baseline category is often referred as relative risk (and it is also sometimes referred as odds as we have just used to described the regression parameters above). The relative risk is the right-hand side linear equation exponentiated, leading to the fact that the exponentiated regression coefficients are relative risk ratios for a unit change in the predictor variable. We can exponentiate the coefficients from our model to see these risk ratios.
## extract the coefficients from the model and exponentiate exp(coef(test))
## (Intercept) sesmiddle seshigh write ## general 17.32582 0.5866769 0.3126026 0.9437172 ## vocation 184.61262 1.3382809 0.3743123 0.8926116
ses
= 1 to 3 is .3126 for being in general program vs. academic program.You can also use predicted probabilities to help you understand the model. You can calculate predicted probabilities for each of our outcome levels using the fitted
function. We can start by generating the predicted probabilities for the observations in our dataset and viewing the first few rows
head(pp <- fitted(test))
## academic general vocation ## 1 0.1482764 0.3382454 0.5134781 ## 2 0.1202017 0.1806283 0.6991700 ## 3 0.4186747 0.2368082 0.3445171 ## 4 0.1726885 0.3508384 0.4764731 ## 5 0.1001231 0.1689374 0.7309395 ## 6 0.3533566 0.2377976 0.4088458
write
at its mean and examining the predicted probabilities for each level of ses
.
dses <- data.frame(ses = c("low", "middle", "high"), write = mean(ml$write)) predict(test, newdata = dses, "probs")
## academic general vocation ## 1 0.4396845 0.3581917 0.2021238 ## 2 0.4777488 0.2283353 0.2939159 ## 3 0.7009007 0.1784939 0.1206054
write
within each level of ses
.
dwrite <- data.frame(ses = rep(c("low", "middle", "high"), each = 41), write = rep(c(30:70), 3)) ## store the predicted probabilities for each value of ses and write pp.write <- cbind(dwrite, predict(test, newdata = dwrite, type = "probs", se = TRUE)) ## calculate the mean probabilities within each level of ses by(pp.write[, 3:5], pp.write$ses, colMeans)
## pp.write$ses: high ## academic general vocation ## 0.6164315 0.1808037 0.2027648 ## -------------------------------------------------------- ## pp.write$ses: low ## academic general vocation ## 0.3972977 0.3278174 0.2748849 ## -------------------------------------------------------- ## pp.write$ses: middle ## academic general vocation ## 0.4256198 0.2010864 0.3732938
pp.write
object above, we can plot the predicted probabilities against the writing score by the level of ses
for different levels of the outcome variable.
## melt data set to long for ggplot2 lpp <- melt(pp.write, id.vars = c("ses", "write"), value.name = "probability") head(lpp) # view first few rows
## ses write variable probability ## 1 low 30 academic 0.09843588 ## 2 low 31 academic 0.10716868 ## 3 low 32 academic 0.11650390 ## 4 low 33 academic 0.12645834 ## 5 low 34 academic 0.13704576 ## 6 low 35 academic 0.14827643
## plot predicted probabilities across write values for each level of ses ## facetted by program type ggplot(lpp, aes(x = write, y = probability, colour = ses)) + geom_line() + facet_grid(variable ~ ., scales = "free")