NMST432 Advanced Regression Models

Manipulating multi-way contingency tables in R

Creating a contingency table from individual listings

Suppose the source data contain the listing of individual observations

dim(nels)

## [1] 13580     9

str(nels)

## 'data.frame':    13580 obs. of  9 variables:
##  $ ses    : Factor w/ 4 levels "1","2","3","4": 2 1 2 2 2 2 3 3 3 3 ...
##  $ sesmed : Factor w/ 2 levels "Below med.","Over med.": 1 1 1 1 1 1 2 2 2 2 ...
##  $ parents: Factor w/ 2 levels "Two_parents",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ foreign: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ fa.educ: Factor w/ 3 levels "Elementary","High",..: 2 1 2 2 1 1 2 3 2 2 ...
##  $ mo.educ: Factor w/ 3 levels "Elementary","High",..: 2 3 2 2 3 2 2 2 2 2 ...
##  $ region : Factor w/ 4 levels "Northeast","Midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ fa.wrk : Factor w/ 2 levels "Working","Not_working": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mo.wrk : Factor w/ 2 levels "Working","Not_working": 1 1 1 1 1 2 1 1 1 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:13814] 2 5 6 9 10 11 12 19 21 23 ...
##   .. ..- attr(*, "names")= chr [1:13814] "2" "5" "6" "9" ...

head(nels)

##    ses     sesmed     parents foreign    fa.educ mo.educ    region  fa.wrk      mo.wrk
## 1    2 Below med. Two_parents      No       High    High Northeast Working     Working
## 3    1 Below med. Two_parents      No Elementary College Northeast Working     Working
## 4    2 Below med. Two_parents      No       High    High Northeast Working     Working
## 7    2 Below med. Two_parents      No       High    High Northeast Working     Working
## 8    2 Below med. Two_parents      No Elementary College Northeast Working     Working
## 13   2 Below med. Two_parents      No Elementary    High Northeast Working Not_working

The contingency table classifying the observation according the desired factors can be created by calling the function table as follows:

xtab <- with(nels, table(ses, parents, foreign, fa.educ, mo.educ, region, fa.wrk, mo.wrk))

The result is an 8-way contingency table, which is too large and messy to print out. The table has 2304 cells, 1269 of them are empty:

dim(xtab)

## [1] 4 2 2 3 3 4 2 2

prod(dim(xtab))

## [1] 2304

str(xtab)

##  'table' int [1:4, 1:2, 1:2, 1:3, 1:3, 1:4, 1:2, 1:2] 23 7 0 0 3 0 0 0 26 15 ...
##  - attr(*, "dimnames")=List of 8
##   ..$ ses    : chr [1:4] "1" "2" "3" "4"
##   ..$ parents: chr [1:2] "Two_parents" "One_parent"
##   ..$ foreign: chr [1:2] "Yes" "No"
##   ..$ fa.educ: chr [1:3] "Elementary" "High" "College"
##   ..$ mo.educ: chr [1:3] "Elementary" "High" "College"
##   ..$ region : chr [1:4] "Northeast" "Midwest" "South" "West"
##   ..$ fa.wrk : chr [1:2] "Working" "Not_working"
##   ..$ mo.wrk : chr [1:2] "Working" "Not_working"

sum(xtab == 0)

## [1] 1269

Creating flat contingency tables

Flat contingency tables are formed and printed in a different way, which is easier to view. We can create a flat contingency table from the original dataset.

tt <- ftable(ses ~ parents + foreign + region, data = nels)
str(tt)

##  'ftable' int [1:16, 1:4] 61 88 230 290 227 471 641 109 17 17 ...
##  - attr(*, "row.vars")=List of 3
##   ..$ parents: chr [1:2] "Two_parents" "One_parent"
##   ..$ foreign: chr [1:2] "Yes" "No"
##   ..$ region : chr [1:4] "Northeast" "Midwest" "South" "West"
##  - attr(*, "col.vars")=List of 1
##   ..$ ses: chr [1:4] "1" "2" "3" "4"

dim(tt)

## [1] 16  4

print(tt)

##                               ses    1    2    3    4
## parents     foreign region                           
## Two_parents Yes     Northeast       61   64   40  135
##                     Midwest         88   73   40   89
##                     South          230  142  111  164
##                     West           290  162  143  159
##             No      Northeast      227  375  453  758
##                     Midwest        471  844  728  916
##                     South          641  731  828 1158
##                     West           109  283  427  554
## One_parent  Yes     Northeast       17    8    7   14
##                     Midwest         17    7    7    7
##                     South           59   35   14   19
##                     West            38   25   24   16
##             No      Northeast       78   59   72   89
##                     Midwest        172  141  111   79
##                     South          276  168  146  129
##                     West            32   60   80   80

Here we used only 4 factors and selected SES as the factor to be printed in the columns.

The same flat table could be created from the already tabulated data xtab by changing the argument data= of the function ftable:

(tt <- ftable(ses ~ parents + foreign + region, data = xtab))

##                               ses    1    2    3    4
## parents     foreign region                           
## Two_parents Yes     Northeast       61   64   40  135
##                     Midwest         88   73   40   89
##                     South          230  142  111  164
##                     West           290  162  143  159
##             No      Northeast      227  375  453  758
##                     Midwest        471  844  728  916
##                     South          641  731  828 1158
##                     West           109  283  427  554
## One_parent  Yes     Northeast       17    8    7   14
##                     Midwest         17    7    7    7
##                     South           59   35   14   19
##                     West            38   25   24   16
##             No      Northeast       78   59   72   89
##                     Midwest        172  141  111   79
##                     South          276  168  146  129
##                     West            32   60   80   80

To see the conditional distribution of SES given the other factors, we can calculate row proportions by the function prop.table.

round(prop.table(tt, 1), 2)

##                               ses    1    2    3    4
## parents     foreign region                           
## Two_parents Yes     Northeast     0.20 0.21 0.13 0.45
##                     Midwest       0.30 0.25 0.14 0.31
##                     South         0.36 0.22 0.17 0.25
##                     West          0.38 0.21 0.19 0.21
##             No      Northeast     0.13 0.21 0.25 0.42
##                     Midwest       0.16 0.29 0.25 0.31
##                     South         0.19 0.22 0.25 0.34
##                     West          0.08 0.21 0.31 0.40
## One_parent  Yes     Northeast     0.37 0.17 0.15 0.30
##                     Midwest       0.45 0.18 0.18 0.18
##                     South         0.46 0.28 0.11 0.15
##                     West          0.37 0.24 0.23 0.16
##             No      Northeast     0.26 0.20 0.24 0.30
##                     Midwest       0.34 0.28 0.22 0.16
##                     South         0.38 0.23 0.20 0.18
##                     West          0.13 0.24 0.32 0.32

Now we can see how the estimated distribution of SES depends on the other factors.

Creating datasets for loglinear model fitting

To fit loglinear models, we need to create a data frame where one variable contains the observed counts in each cell and other variables include the factors classifying the cells. There should be one observation per each combination of the classification factor levels. Such a dataset can be created by calling the function as.data.frame on the contingency table prepared by the function table.

xtab <- with(nels,table(ses, parents, foreign, fa.educ, mo.educ, region, fa.wrk, mo.wrk))
qq <- as.data.frame(xtab, responseName = "N")
dim(qq)

## [1] 2304    9

str(qq)

## 'data.frame':    2304 obs. of  9 variables:
##  $ ses    : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
##  $ parents: Factor w/ 2 levels "Two_parents",..: 1 1 1 1 2 2 2 2 1 1 ...
##  $ foreign: Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 2 ...
##  $ fa.educ: Factor w/ 3 levels "Elementary","High",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ mo.educ: Factor w/ 3 levels "Elementary","High",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ region : Factor w/ 4 levels "Northeast","Midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ fa.wrk : Factor w/ 2 levels "Working","Not_working": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mo.wrk : Factor w/ 2 levels "Working","Not_working": 1 1 1 1 1 1 1 1 1 1 ...
##  $ N      : int  23 7 0 0 3 0 0 0 26 15 ...

head(qq)

##   ses     parents foreign    fa.educ    mo.educ    region  fa.wrk  mo.wrk  N
## 1   1 Two_parents     Yes Elementary Elementary Northeast Working Working 23
## 2   2 Two_parents     Yes Elementary Elementary Northeast Working Working  7
## 3   3 Two_parents     Yes Elementary Elementary Northeast Working Working  0
## 4   4 Two_parents     Yes Elementary Elementary Northeast Working Working  0
## 5   1  One_parent     Yes Elementary Elementary Northeast Working Working  3
## 6   2  One_parent     Yes Elementary Elementary Northeast Working Working  0

The counts are included in the variable N, as requested by the argument responseName to the function as.data.frame.

Now we can fit a loglinear model using selected factors as predictors:

fit <- glm(N ~ (ses + fa.wrk + foreign)^2, family = poisson, data = qq)
summary(fit)

## 
## Call:
## glm(formula = N ~ (ses + fa.wrk + foreign)^2, family = poisson, 
##     data = qq)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -6.997  -2.789  -1.531  -0.684  53.471  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  1.47802    0.03820  38.694  < 2e-16 ***
## ses2                        -0.33556    0.05786  -5.800 6.63e-09 ***
## ses3                        -0.58325    0.06324  -9.223  < 2e-16 ***
## ses4                        -0.11995    0.05537  -2.166   0.0303 *  
## fa.wrkNot_working           -1.31990    0.06857 -19.248  < 2e-16 ***
## foreignNo                    0.94523    0.04426  21.358  < 2e-16 ***
## ses2:fa.wrkNot_working      -0.62361    0.07389  -8.440  < 2e-16 ***
## ses3:fa.wrkNot_working      -1.02818    0.08173 -12.580  < 2e-16 ***
## ses4:fa.wrkNot_working      -1.24611    0.07860 -15.854  < 2e-16 ***
## ses2:foreignNo               0.71045    0.06401  11.099  < 2e-16 ***
## ses3:foreignNo               1.06290    0.06901  15.402  < 2e-16 ***
## ses4:foreignNo               0.89453    0.06135  14.580  < 2e-16 ***
## fa.wrkNot_working:foreignNo -0.12949    0.07068  -1.832   0.0669 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 65251  on 2303  degrees of freedom
## Residual deviance: 48150  on 2291  degrees of freedom
## AIC: 51458
## 
## Number of Fisher Scoring iterations: 7

drop1(fit, test = "Chisq")

## Single term deletions
## 
## Model:
## N ~ (ses + fa.wrk + foreign)^2
##                Df Deviance   AIC     LRT Pr(>Chi)    
## <none>               48150 51458                     
## ses:fa.wrk      3    48450 51753 300.528  < 2e-16 ***
## ses:foreign     3    48458 51761 308.570  < 2e-16 ***
## fa.wrk:foreign  1    48153 51459   3.298  0.06936 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Creating marginal datasets for loglinear model fitting

Sometimes we would like to reformat the dataset back into a contingency table, perhaps a less detailed (marginal) one. This can be achieved by the function xtabs.

(marg.tbl <- xtabs(N ~ ses + fa.wrk + foreign, data = qq))

## , , foreign = Yes
## 
##    fa.wrk
## ses Working Not_working
##   1     647         153
##   2     441          75
##   3     350          36
##   4     557          46
## 
## , , foreign = No
## 
##    fa.wrk
## ses Working Not_working
##   1    1609         397
##   2    2374         287
##   3    2627         218
##   4    3528         235

str(marg.tbl)

##  int [1:4, 1:2, 1:2] 647 441 350 557 153 75 36 46 1609 2374 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ ses    : chr [1:4] "1" "2" "3" "4"
##   ..$ fa.wrk : chr [1:2] "Working" "Not_working"
##   ..$ foreign: chr [1:2] "Yes" "No"
##  - attr(*, "class")= chr [1:2] "xtabs" "table"
##  - attr(*, "call")= language xtabs(formula = N ~ ses + fa.wrk + foreign, data = qq)

The object marg.tbl behaves as if it were created by the function table from the original data nels. E.g., we can print it out in the flat table format.

ftable(marg.tbl, col.vars = "ses")

##                     ses    1    2    3    4
## fa.wrk      foreign                        
## Working     Yes          647  441  350  557
##             No          1609 2374 2627 3528
## Not_working Yes          153   75   36   46
##             No           397  287  218  235

Now reformat marg.tbl back into a dataframe and fit the same loglinear model as before

fit2 <- glm(N ~ (ses + fa.wrk + foreign)^2, family = poisson, data = as.data.frame(marg.tbl, responseName = "N"))
summary(fit2)

## 
## Call:
## glm(formula = N ~ (ses + fa.wrk + foreign)^2, family = poisson, 
##     data = as.data.frame(marg.tbl, responseName = "N"))
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7         8         9        10  
##  0.62103  -0.48972  -0.12445  -0.12565  -1.22589   1.25680   0.39765   0.44781  -0.38935   0.21302  
##       11        12        13        14        15        16  
##  0.04554   0.05003   0.79696  -0.60457  -0.15748  -0.19298  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  6.44783    0.03820 168.804  < 2e-16 ***
## ses2                        -0.33556    0.05786  -5.800 6.63e-09 ***
## ses3                        -0.58325    0.06324  -9.223  < 2e-16 ***
## ses4                        -0.11995    0.05537  -2.166   0.0303 *  
## fa.wrkNot_working           -1.31990    0.06857 -19.248  < 2e-16 ***
## foreignNo                    0.94523    0.04426  21.358  < 2e-16 ***
## ses2:fa.wrkNot_working      -0.62361    0.07389  -8.440  < 2e-16 ***
## ses3:fa.wrkNot_working      -1.02818    0.08173 -12.580  < 2e-16 ***
## ses4:fa.wrkNot_working      -1.24611    0.07860 -15.854  < 2e-16 ***
## ses2:foreignNo               0.71045    0.06401  11.099  < 2e-16 ***
## ses3:foreignNo               1.06290    0.06901  15.402  < 2e-16 ***
## ses4:foreignNo               0.89453    0.06135  14.580  < 2e-16 ***
## fa.wrkNot_working:foreignNo -0.12949    0.07068  -1.832   0.0669 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 17107.047  on 15  degrees of freedom
## Residual deviance:     5.362  on  3  degrees of freedom
## AIC: 155.97
## 
## Number of Fisher Scoring iterations: 4

drop1(fit2, test = "Chisq")

## Single term deletions
## 
## Model:
## N ~ (ses + fa.wrk + foreign)^2
##                Df Deviance    AIC     LRT Pr(>Chi)    
## <none>               5.362 155.97                     
## ses:fa.wrk      3  305.890 450.50 300.528  < 2e-16 ***
## ses:foreign     3  313.932 458.54 308.570  < 2e-16 ***
## fa.wrk:foreign  1    8.660 157.27   3.298  0.06936 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the intercept, the deviance and the residuals changed compared to fit but the parameter estimates (except the intercept) and test results are the same. Can you explain this?

Back to course page