Suppose the source data contain the listing of individual observations
dim(nels)
## [1] 13580 9
str(nels)
## 'data.frame': 13580 obs. of 9 variables:
## $ ses : Factor w/ 4 levels "1","2","3","4": 2 1 2 2 2 2 3 3 3 3 ...
## $ sesmed : Factor w/ 2 levels "Below med.","Over med.": 1 1 1 1 1 1 2 2 2 2 ...
## $ parents: Factor w/ 2 levels "Two_parents",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ foreign: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ fa.educ: Factor w/ 3 levels "Elementary","High",..: 2 1 2 2 1 1 2 3 2 2 ...
## $ mo.educ: Factor w/ 3 levels "Elementary","High",..: 2 3 2 2 3 2 2 2 2 2 ...
## $ region : Factor w/ 4 levels "Northeast","Midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ fa.wrk : Factor w/ 2 levels "Working","Not_working": 1 1 1 1 1 1 1 1 1 1 ...
## $ mo.wrk : Factor w/ 2 levels "Working","Not_working": 1 1 1 1 1 2 1 1 1 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:13814] 2 5 6 9 10 11 12 19 21 23 ...
## .. ..- attr(*, "names")= chr [1:13814] "2" "5" "6" "9" ...
head(nels)
## ses sesmed parents foreign fa.educ mo.educ region fa.wrk mo.wrk
## 1 2 Below med. Two_parents No High High Northeast Working Working
## 3 1 Below med. Two_parents No Elementary College Northeast Working Working
## 4 2 Below med. Two_parents No High High Northeast Working Working
## 7 2 Below med. Two_parents No High High Northeast Working Working
## 8 2 Below med. Two_parents No Elementary College Northeast Working Working
## 13 2 Below med. Two_parents No Elementary High Northeast Working Not_working
The contingency table classifying the observation according the desired factors can be created by calling the function table
as follows:
xtab <- with(nels, table(ses, parents, foreign, fa.educ, mo.educ, region, fa.wrk, mo.wrk))
The result is an 8-way contingency table, which is too large and messy to print out. The table has 2304 cells, 1269 of them are empty:
dim(xtab)
## [1] 4 2 2 3 3 4 2 2
prod(dim(xtab))
## [1] 2304
str(xtab)
## 'table' int [1:4, 1:2, 1:2, 1:3, 1:3, 1:4, 1:2, 1:2] 23 7 0 0 3 0 0 0 26 15 ...
## - attr(*, "dimnames")=List of 8
## ..$ ses : chr [1:4] "1" "2" "3" "4"
## ..$ parents: chr [1:2] "Two_parents" "One_parent"
## ..$ foreign: chr [1:2] "Yes" "No"
## ..$ fa.educ: chr [1:3] "Elementary" "High" "College"
## ..$ mo.educ: chr [1:3] "Elementary" "High" "College"
## ..$ region : chr [1:4] "Northeast" "Midwest" "South" "West"
## ..$ fa.wrk : chr [1:2] "Working" "Not_working"
## ..$ mo.wrk : chr [1:2] "Working" "Not_working"
sum(xtab == 0)
## [1] 1269
Flat contingency tables are formed and printed in a different way, which is easier to view. We can create a flat contingency table from the original dataset.
tt <- ftable(ses ~ parents + foreign + region, data = nels)
str(tt)
## 'ftable' int [1:16, 1:4] 61 88 230 290 227 471 641 109 17 17 ...
## - attr(*, "row.vars")=List of 3
## ..$ parents: chr [1:2] "Two_parents" "One_parent"
## ..$ foreign: chr [1:2] "Yes" "No"
## ..$ region : chr [1:4] "Northeast" "Midwest" "South" "West"
## - attr(*, "col.vars")=List of 1
## ..$ ses: chr [1:4] "1" "2" "3" "4"
dim(tt)
## [1] 16 4
print(tt)
## ses 1 2 3 4
## parents foreign region
## Two_parents Yes Northeast 61 64 40 135
## Midwest 88 73 40 89
## South 230 142 111 164
## West 290 162 143 159
## No Northeast 227 375 453 758
## Midwest 471 844 728 916
## South 641 731 828 1158
## West 109 283 427 554
## One_parent Yes Northeast 17 8 7 14
## Midwest 17 7 7 7
## South 59 35 14 19
## West 38 25 24 16
## No Northeast 78 59 72 89
## Midwest 172 141 111 79
## South 276 168 146 129
## West 32 60 80 80
Here we used only 4 factors and selected SES as the factor to be printed in the columns.
The same flat table could be created from the already tabulated data xtab
by changing the argument data=
of the function ftable
:
(tt <- ftable(ses ~ parents + foreign + region, data = xtab))
## ses 1 2 3 4
## parents foreign region
## Two_parents Yes Northeast 61 64 40 135
## Midwest 88 73 40 89
## South 230 142 111 164
## West 290 162 143 159
## No Northeast 227 375 453 758
## Midwest 471 844 728 916
## South 641 731 828 1158
## West 109 283 427 554
## One_parent Yes Northeast 17 8 7 14
## Midwest 17 7 7 7
## South 59 35 14 19
## West 38 25 24 16
## No Northeast 78 59 72 89
## Midwest 172 141 111 79
## South 276 168 146 129
## West 32 60 80 80
To see the conditional distribution of SES given the other factors, we can calculate row proportions by the function prop.table
.
round(prop.table(tt, 1), 2)
## ses 1 2 3 4
## parents foreign region
## Two_parents Yes Northeast 0.20 0.21 0.13 0.45
## Midwest 0.30 0.25 0.14 0.31
## South 0.36 0.22 0.17 0.25
## West 0.38 0.21 0.19 0.21
## No Northeast 0.13 0.21 0.25 0.42
## Midwest 0.16 0.29 0.25 0.31
## South 0.19 0.22 0.25 0.34
## West 0.08 0.21 0.31 0.40
## One_parent Yes Northeast 0.37 0.17 0.15 0.30
## Midwest 0.45 0.18 0.18 0.18
## South 0.46 0.28 0.11 0.15
## West 0.37 0.24 0.23 0.16
## No Northeast 0.26 0.20 0.24 0.30
## Midwest 0.34 0.28 0.22 0.16
## South 0.38 0.23 0.20 0.18
## West 0.13 0.24 0.32 0.32
Now we can see how the estimated distribution of SES depends on the other factors.
To fit loglinear models, we need to create a data frame where one variable contains the observed counts in each cell and other variables include the factors classifying the cells. There should be one observation per each combination of the classification factor levels. Such a dataset can be created by calling the function as.data.frame
on the contingency table prepared by the function table
.
xtab <- with(nels,table(ses, parents, foreign, fa.educ, mo.educ, region, fa.wrk, mo.wrk))
qq <- as.data.frame(xtab, responseName = "N")
dim(qq)
## [1] 2304 9
str(qq)
## 'data.frame': 2304 obs. of 9 variables:
## $ ses : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
## $ parents: Factor w/ 2 levels "Two_parents",..: 1 1 1 1 2 2 2 2 1 1 ...
## $ foreign: Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 2 ...
## $ fa.educ: Factor w/ 3 levels "Elementary","High",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ mo.educ: Factor w/ 3 levels "Elementary","High",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "Northeast","Midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ fa.wrk : Factor w/ 2 levels "Working","Not_working": 1 1 1 1 1 1 1 1 1 1 ...
## $ mo.wrk : Factor w/ 2 levels "Working","Not_working": 1 1 1 1 1 1 1 1 1 1 ...
## $ N : int 23 7 0 0 3 0 0 0 26 15 ...
head(qq)
## ses parents foreign fa.educ mo.educ region fa.wrk mo.wrk N
## 1 1 Two_parents Yes Elementary Elementary Northeast Working Working 23
## 2 2 Two_parents Yes Elementary Elementary Northeast Working Working 7
## 3 3 Two_parents Yes Elementary Elementary Northeast Working Working 0
## 4 4 Two_parents Yes Elementary Elementary Northeast Working Working 0
## 5 1 One_parent Yes Elementary Elementary Northeast Working Working 3
## 6 2 One_parent Yes Elementary Elementary Northeast Working Working 0
The counts are included in the variable N
, as requested by the argument responseName
to the function as.data.frame
.
Now we can fit a loglinear model using selected factors as predictors:
fit <- glm(N ~ (ses + fa.wrk + foreign)^2, family = poisson, data = qq)
summary(fit)
##
## Call:
## glm(formula = N ~ (ses + fa.wrk + foreign)^2, family = poisson,
## data = qq)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.997 -2.789 -1.531 -0.684 53.471
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.47802 0.03820 38.694 < 2e-16 ***
## ses2 -0.33556 0.05786 -5.800 6.63e-09 ***
## ses3 -0.58325 0.06324 -9.223 < 2e-16 ***
## ses4 -0.11995 0.05537 -2.166 0.0303 *
## fa.wrkNot_working -1.31990 0.06857 -19.248 < 2e-16 ***
## foreignNo 0.94523 0.04426 21.358 < 2e-16 ***
## ses2:fa.wrkNot_working -0.62361 0.07389 -8.440 < 2e-16 ***
## ses3:fa.wrkNot_working -1.02818 0.08173 -12.580 < 2e-16 ***
## ses4:fa.wrkNot_working -1.24611 0.07860 -15.854 < 2e-16 ***
## ses2:foreignNo 0.71045 0.06401 11.099 < 2e-16 ***
## ses3:foreignNo 1.06290 0.06901 15.402 < 2e-16 ***
## ses4:foreignNo 0.89453 0.06135 14.580 < 2e-16 ***
## fa.wrkNot_working:foreignNo -0.12949 0.07068 -1.832 0.0669 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 65251 on 2303 degrees of freedom
## Residual deviance: 48150 on 2291 degrees of freedom
## AIC: 51458
##
## Number of Fisher Scoring iterations: 7
drop1(fit, test = "Chisq")
## Single term deletions
##
## Model:
## N ~ (ses + fa.wrk + foreign)^2
## Df Deviance AIC LRT Pr(>Chi)
## <none> 48150 51458
## ses:fa.wrk 3 48450 51753 300.528 < 2e-16 ***
## ses:foreign 3 48458 51761 308.570 < 2e-16 ***
## fa.wrk:foreign 1 48153 51459 3.298 0.06936 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Sometimes we would like to reformat the dataset back into a contingency table, perhaps a less detailed (marginal) one. This can be achieved by the function xtabs
.
(marg.tbl <- xtabs(N ~ ses + fa.wrk + foreign, data = qq))
## , , foreign = Yes
##
## fa.wrk
## ses Working Not_working
## 1 647 153
## 2 441 75
## 3 350 36
## 4 557 46
##
## , , foreign = No
##
## fa.wrk
## ses Working Not_working
## 1 1609 397
## 2 2374 287
## 3 2627 218
## 4 3528 235
str(marg.tbl)
## int [1:4, 1:2, 1:2] 647 441 350 557 153 75 36 46 1609 2374 ...
## - attr(*, "dimnames")=List of 3
## ..$ ses : chr [1:4] "1" "2" "3" "4"
## ..$ fa.wrk : chr [1:2] "Working" "Not_working"
## ..$ foreign: chr [1:2] "Yes" "No"
## - attr(*, "class")= chr [1:2] "xtabs" "table"
## - attr(*, "call")= language xtabs(formula = N ~ ses + fa.wrk + foreign, data = qq)
The object marg.tbl
behaves as if it were created by the function table
from the original data nels
. E.g., we can print it out in the flat table format.
ftable(marg.tbl, col.vars = "ses")
## ses 1 2 3 4
## fa.wrk foreign
## Working Yes 647 441 350 557
## No 1609 2374 2627 3528
## Not_working Yes 153 75 36 46
## No 397 287 218 235
Now reformat marg.tbl
back into a dataframe and fit the same loglinear model as before
fit2 <- glm(N ~ (ses + fa.wrk + foreign)^2, family = poisson, data = as.data.frame(marg.tbl, responseName = "N"))
summary(fit2)
##
## Call:
## glm(formula = N ~ (ses + fa.wrk + foreign)^2, family = poisson,
## data = as.data.frame(marg.tbl, responseName = "N"))
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8 9 10
## 0.62103 -0.48972 -0.12445 -0.12565 -1.22589 1.25680 0.39765 0.44781 -0.38935 0.21302
## 11 12 13 14 15 16
## 0.04554 0.05003 0.79696 -0.60457 -0.15748 -0.19298
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.44783 0.03820 168.804 < 2e-16 ***
## ses2 -0.33556 0.05786 -5.800 6.63e-09 ***
## ses3 -0.58325 0.06324 -9.223 < 2e-16 ***
## ses4 -0.11995 0.05537 -2.166 0.0303 *
## fa.wrkNot_working -1.31990 0.06857 -19.248 < 2e-16 ***
## foreignNo 0.94523 0.04426 21.358 < 2e-16 ***
## ses2:fa.wrkNot_working -0.62361 0.07389 -8.440 < 2e-16 ***
## ses3:fa.wrkNot_working -1.02818 0.08173 -12.580 < 2e-16 ***
## ses4:fa.wrkNot_working -1.24611 0.07860 -15.854 < 2e-16 ***
## ses2:foreignNo 0.71045 0.06401 11.099 < 2e-16 ***
## ses3:foreignNo 1.06290 0.06901 15.402 < 2e-16 ***
## ses4:foreignNo 0.89453 0.06135 14.580 < 2e-16 ***
## fa.wrkNot_working:foreignNo -0.12949 0.07068 -1.832 0.0669 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 17107.047 on 15 degrees of freedom
## Residual deviance: 5.362 on 3 degrees of freedom
## AIC: 155.97
##
## Number of Fisher Scoring iterations: 4
drop1(fit2, test = "Chisq")
## Single term deletions
##
## Model:
## N ~ (ses + fa.wrk + foreign)^2
## Df Deviance AIC LRT Pr(>Chi)
## <none> 5.362 155.97
## ses:fa.wrk 3 305.890 450.50 300.528 < 2e-16 ***
## ses:foreign 3 313.932 458.54 308.570 < 2e-16 ***
## fa.wrk:foreign 1 8.660 157.27 3.298 0.06936 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that the intercept, the deviance and the residuals changed compared to fit
but the parameter estimates (except the intercept) and test results are the same. Can you explain this?