Download this R markdown as: R, Rmd.
Outline of this lab session:
The idea is to consider a general version of the linear regression model where \[ \boldsymbol{Y} \sim \mathsf{N}(\mathbb{X}\boldsymbol{\beta}, \sigma^2 \mathbb{W}^{-1}), \] where \(\boldsymbol{Y} = (Y_1, \dots, Y_n)^\top\) is the response vector for \(n \in \mathbb{N}\) independent subjects, \(\mathbb{X}\) is the corresponding model matrix (of a full rank \(p \in \mathbb{N}\)) and \(\boldsymbol{\beta} \in \mathbb{R}^p\) is the vector of the unknown parameters.
Note, that the normality assumption above is only used as a simplification but it is not needed to fit the model (without normality the results and the inference will hold asymptotically).
Considering the variance structure, reflected in the matrix \(\mathbb{W} \in \mathbb{R}^{n \times n}\) (a diagonal matrix because the subjects are mutually independent) can can distinguish two cases:
These cases lead to the so-called
Both cases will be briefly addressed below.
library("mffSM") # Hlavy, plotLM()
library("colorspace") # color palettes
library("MASS") # sdres()
library("sandwich") # vcovHC()
library("lmtest") # coeftest()
data(Hlavy, package = "mffSM")
We start with the dataset Hlavy (heads) which represents
head sizes of fetuses (from an ultrasound monitoring) in different
periods of pregnancy. Each data value (row in the dataframe) provides an
average over \(n_i \in \mathbb{N}\)
measurements of \(n_i\) independent
subjects (mothers).
The information provided in the data stand for the:
i serial number,t week of the pregnancy,n number of measurements used to calculate the average
head size,y the average head size obtained from the particular
number of measurements.head(Hlavy)
## i t n y
## 1 1 16 2 5.100
## 2 2 18 3 5.233
## 3 3 19 9 4.744
## 4 4 20 10 5.110
## 5 5 21 11 5.236
## 6 6 22 20 5.740
dim(Hlavy)
## [1] 26 4
summary(Hlavy)
## i t n y
## Min. : 1.00 Min. :16.00 Min. : 2.00 Min. :4.744
## 1st Qu.: 7.25 1st Qu.:23.25 1st Qu.:20.25 1st Qu.:5.871
## Median :13.50 Median :29.50 Median :47.00 Median :7.776
## Mean :13.50 Mean :29.46 Mean :39.85 Mean :7.461
## 3rd Qu.:19.75 3rd Qu.:35.75 3rd Qu.:60.00 3rd Qu.:8.980
## Max. :26.00 Max. :42.00 Max. :85.00 Max. :9.633
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(y ~ t, data = Hlavy, pch = 23, col = "red3", bg = "orange")

The simple scatterplot above can be, however, substantially improved by accounting also for reliability of each observation (higher number of measurements used to obtain the average also means higher reliability). This can be obtain, for instance, by the following code:
with(Hlavy, quantile(n, prob = seq(0, 1, by = 0.2)))
## 0% 20% 40% 60% 80% 100%
## 2 11 31 53 60 85
Hlavy <- transform(Hlavy,
fn = cut(n, breaks = c(0, 10, 30, 50, 60, 90),
labels = c("<=10", "11-30", "31-50", "51-60", ">60")),
cexn = n/25+1)
with(Hlavy, summary(fn))
## <=10 11-30 31-50 51-60 >60
## 5 5 5 6 5
plotHlavy <- function(){
PAL <- rev(heat_hcl(nlevels(Hlavy[, "fn"]), c.=c(80, 30), l=c(30, 90), power=c(1/5, 1.3)))
names(PAL) <- levels(Hlavy[, "fn"])
plot(y ~ t, data = Hlavy,
xlab = "Week", ylab = "Average head size",
pch = 23, col = "black", bg = PAL[fn], cex = cexn)
legend(17, 9, legend = levels(Hlavy[, "fn"]), title = "n", fill = PAL)
abline(v = 27, col = "lightblue", lty = 2)
}
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plotHlavy()

It is reasonable to assume, that for each independent measurement \(\widetilde{Y}_{ij}\) we have some model of the form \[ \widetilde{Y}_{ij} | \boldsymbol{X}_{i} \sim \mathsf{N}(\boldsymbol{X}_{i}^\top\boldsymbol{\beta}, \sigma^2) \] where for each \(j \in \{1, \ldots, n_i\}\) these observations are measured under the same set of covariates \(\mathbf{X_i}\). This is a homoscedastic model.
Unfortunately, we do not observe \(\widetilde{Y}_{ij}\) directly, but, instead, we only observe the overall average \(Y_i = \frac{1}{n_i}\sum_{j = 1}^{n_i} \widetilde{Y}_{ij}\).
It also holds, that \[ Y_i \sim \mathsf{N}(\boldsymbol{X}_i^\top\boldsymbol{\beta}, \sigma^2/{n_i}), \] or, considering all data together, we have \[ \boldsymbol{Y} \sim \mathsf{N}_n(\mathbb{X}\boldsymbol{\beta}, \sigma^2 \mathbb{W}^{-1}), \] where \(\mathbb{W}^{-1}\) is a diagonal matrix \(\mathbb{W}^{-1} = \mathsf{diag}(1/n_1, \dots, 1/n_n)\) (thus, the variance structure is known from the design of the experiment). The matrix \(\mathbb{W} = \mathsf{diag}(n_1, \dots, n_n)\) is the weight matrix which gives each \(Y_i\) the weight corresponding to number of observations it has been obtained from.
In addition, practical motivation behind the model is the following:
Practitioners say that the relationship y ~ t could be
described by a piecewise quadratic function with t = 27
being a point where the quadratic relationship changes its shape.
From the practical point of view, the fitted curve should be
continuous at t = 27 and perhaps also
smooth (i.e., with continuous at least the first
derivative) as it is biologically not defensible to claim that at
t = 27 the growth has a jump in the speed (rate) of the
growth.
As far as different observations have different credibility, we will use the general linear model with the known matrix \(\mathbb{W}\). The model can be fitted as follows:
Hlavy <- transform(Hlavy, t27 = t - 27)
summary(mCont <- lm(y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2),
weights = n, data = Hlavy))
##
## Call:
## lm(formula = y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2),
## data = Hlavy, weights = n)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.88212 -0.35216 0.01232 0.37917 0.85320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.049749 0.042425 166.170 < 2e-16 ***
## t27 0.381177 0.030288 12.585 3.01e-11 ***
## I(t27^2) 0.016214 0.003943 4.112 0.000497 ***
## t27:t27 > 0TRUE -0.063531 0.039426 -1.611 0.122017
## I(t27^2):t27 > 0TRUE -0.026786 0.003776 -7.094 5.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.477 on 21 degrees of freedom
## Multiple R-squared: 0.9968, Adjusted R-squared: 0.9962
## F-statistic: 1652 on 4 and 21 DF, p-value: < 2.2e-16
Compare to the naive (and incorrect) model where the weights are ignored:
mCont_noweights <- lm(y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2), data = Hlavy)
summary(mCont_noweights)
##
## Call:
## lm(formula = y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2),
## data = Hlavy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34389 -0.05749 -0.01730 0.05353 0.22852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.084353 0.069011 102.655 < 2e-16 ***
## t27 0.422937 0.032626 12.963 1.73e-11 ***
## I(t27^2) 0.021672 0.003110 6.968 7.00e-07 ***
## t27:t27 > 0TRUE -0.121159 0.049259 -2.460 0.0227 *
## I(t27^2):t27 > 0TRUE -0.030978 0.002924 -10.593 7.00e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1212 on 21 degrees of freedom
## Multiple R-squared: 0.9955, Adjusted R-squared: 0.9947
## F-statistic: 1173 on 4 and 21 DF, p-value: < 2.2e-16
Fitted curves:
tgrid <- seq(16, 42, length = 100)
pdata <- data.frame(t27 = tgrid - 27)
fitCont <- predict(mCont, newdata = pdata)
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plotHlavy()
lines(tgrid, fitCont, col = "red2", lwd = 3)
lines(tgrid, predict(mCont_noweights, newdata = pdata), col = "navyblue", lwd = 2, lty = 2)
legend("bottomright", bty = "n",
legend = c(bquote(Weights: hat(sigma) == .(summary(mCont)$sigma)),
bquote(Noweights: hat(sigma) == .(summary(mCont_noweights)$sigma))),
col = c("red2", "navyblue"), lty = c(1,2), lwd = c(2,3))

### Key matrices
W <- diag(Hlavy$n)
Y <- Hlavy$y
X <- model.matrix(mCont)
### Estimated regression coefficients
c(betahat <- solve(t(X)%*%W%*%X)%*%(t(X)%*%W%*%Y))
## [1] 7.04974918 0.38117667 0.01621412 -0.06353122 -0.02678567
coef(mCont)
## (Intercept) t27 I(t27^2) t27:t27 > 0TRUE I(t27^2):t27 > 0TRUE
## 7.04974918 0.38117667 0.01621412 -0.06353122 -0.02678567
### Generalized fitted values
c(Yg <- X%*%betahat)
## [1] 4.818714 4.932503 5.038040 5.176004 5.346398 5.549219 5.784468 6.052146 6.352252 6.684787 7.049749
## [12] 7.356823 7.642754 7.907542 8.151186 8.373688 8.575046 8.755262 8.914334 9.052263 9.169049 9.264692
## [23] 9.339192 9.392549 9.424762 9.435833
fitted(mCont)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 4.818714 4.932503 5.038040 5.176004 5.346398 5.549219 5.784468 6.052146 6.352252 6.684787 7.049749 7.356823
## 13 14 15 16 17 18 19 20 21 22 23 24
## 7.642754 7.907542 8.151186 8.373688 8.575046 8.755262 8.914334 9.052263 9.169049 9.264692 9.339192 9.392549
## 25 26
## 9.424762 9.435833
### Generalized residual sum of squares
(RSS <- t(Y - Yg)%*%W%*%(Y - Yg))
## [,1]
## [1,] 4.777559
### Generalized mean square
RSS/(length(Y)-ncol(X))
## [,1]
## [1,] 0.2275028
(summary(mCont)$sigma)^2
## [1] 0.2275028
### Be careful that Multiple R-squared: 0.9968 reported in the output "summary(mCont)" is not equal to
sum((Yg-mean(Yg))^2)/sum((Y-mean(Y))^2)
## [1] 1.005601
### Instead, it equals the following quantity that is more difficult to interpret
barYw <- sum(Y*Hlavy$n)/sum(Hlavy$n)
### weighted mean of the original observations
### something as "generalized" regression sum of squares
SSR <- sum((Yg - barYw)^2 * Hlavy$n)
### something as "generalized" total variance
SST <- sum((Y - barYw)^2 * Hlavy$n)
### "generalized" R squared
SSR/SST
## [1] 0.9968318
summary(mCont)$r.squared
## [1] 0.9968318
iii <- rep(1:nrow(Hlavy), Hlavy$n)
Hlavy2 <- Hlavy[iii,] # new dataset
summary(mCont2 <- lm(y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2), data = Hlavy2))
##
## Call:
## lm(formula = y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2),
## data = Hlavy2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29404 -0.04505 -0.01469 0.04125 0.30050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0497492 0.0060548 1164.32 <2e-16 ***
## t27 0.3811767 0.0043227 88.18 <2e-16 ***
## I(t27^2) 0.0162141 0.0005628 28.81 <2e-16 ***
## t27:t27 > 0TRUE -0.0635312 0.0056268 -11.29 <2e-16 ***
## I(t27^2):t27 > 0TRUE -0.0267857 0.0005389 -49.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06807 on 1031 degrees of freedom
## Multiple R-squared: 0.9968, Adjusted R-squared: 0.9968
## F-statistic: 8.11e+04 on 4 and 1031 DF, p-value: < 2.2e-16
summary(mCont)
##
## Call:
## lm(formula = y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2),
## data = Hlavy, weights = n)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.88212 -0.35216 0.01232 0.37917 0.85320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.049749 0.042425 166.170 < 2e-16 ***
## t27 0.381177 0.030288 12.585 3.01e-11 ***
## I(t27^2) 0.016214 0.003943 4.112 0.000497 ***
## t27:t27 > 0TRUE -0.063531 0.039426 -1.611 0.122017
## I(t27^2):t27 > 0TRUE -0.026786 0.003776 -7.094 5.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.477 on 21 degrees of freedom
## Multiple R-squared: 0.9968, Adjusted R-squared: 0.9962
## F-statistic: 1652 on 4 and 21 DF, p-value: < 2.2e-16
For a general heteroscedastic model it is usually
assumed that
\[
Y_i | \boldsymbol{X}_i \sim
\mathsf{N}(\boldsymbol{X}_i^\top\boldsymbol{\beta},
\sigma^2(\boldsymbol{X}_i))
\] where \(\sigma^2(\cdot)\) is
some positive (unknown) function of the response variables \(\boldsymbol{X}_i \in \mathbb{R}^p\). Note,
that the normality assumption is here only given for simplification but
it is not required.
In the following, we will consider the dataset that comes from the
project MANA (Maturita na necisto) launched in the Czech Republic in
2005. This project was a part of a bigger project whose goal was to
prepare a new form of graduation exam (“statni maturita”).
This dataset contains the results in English language at grammar
schools.
load(url("http://msekce.karlin.mff.cuni.cz/~vavraj/nmfm334/data/mana.RData"))
The variables in the data are:
region (1 = Praha; 2 = Stredocesky; 3 = Jihocesky; 4 =
Plzensky; 5 = Karlovarsky; 6 = Ustecky; 7 = Liberecky; 8 =
Kralovehradecky; 9 = Pardubicky; 10 = Vysocina; 11 = Jihomoravsky; 12 =
Olomoucky; 13 = Zlinsky; 14 = Moravskoslezsky),specialization (1 = education; 2 = social science; 3 =
languages; 4 = law; 5 = math-physics; 6 = engineering; 7 = science; 8 =
medicine; 9 = economics; 10 = agriculture; 11 = art),gender (0 = female; 1 = male),graduation (0 = no; 1 = yes),entry.exam(0 = no; 1 = yes),score (score in English language in MANA test),avg.mark (average mark (grade) from all subjects at the
last report card), andmark.eng (average mark (grade) from the English
language at the last 5 report cards).For better intuition, we can transform the data as follows:
mana <- transform(mana,
fregion = factor(region, levels = 1:14,
labels = c("A", "S", "C", "P", "K", "U", "L",
"H", "E", "J", "B", "M", "Z", "T")),
fspec = factor(specialization, levels = 1:11,
labels = c("Educ", "Social", "Lang", "Law",
"Mat-Phys", "Engin", "Science",
"Medic", "Econom", "Agric", "Art")),
fgender = factor(gender, levels = 0:1, labels = c("Female", "Male")),
fgrad = factor(graduation, levels = 0:1, labels = c("No", "Yes")),
fentry = factor(entry.exam, levels = 0:1, labels = c("No", "Yes")))
summary(mana)
## region specialization gender graduation entry.exam score
## Min. : 1.000 Min. : 1.000 Min. :0.000 Min. :0.0000 Min. :0.0000 Min. : 5.00
## 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.:0.000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:40.00
## Median : 7.000 Median : 6.000 Median :0.000 Median :1.0000 Median :0.0000 Median :44.00
## Mean : 7.462 Mean : 5.496 Mean :0.415 Mean :0.9585 Mean :0.4536 Mean :43.02
## 3rd Qu.:11.000 3rd Qu.: 8.000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:48.00
## Max. :14.000 Max. :11.000 Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :52.00
##
## avg.mark mark.eng fregion fspec fgender fgrad fentry
## Min. :1.000 Min. :1.000 T : 923 Social : 855 Female:3044 No : 216 No :2843
## 1st Qu.:1.480 1st Qu.:1.400 S : 665 Econom : 846 Male :2159 Yes:4987 Yes:2360
## Median :1.640 Median :2.000 B : 634 Engin : 638
## Mean :1.645 Mean :2.176 A : 599 Medic : 564
## 3rd Qu.:1.810 3rd Qu.:2.800 L : 534 Science: 557
## Max. :2.470 Max. :4.200 C : 474 Educ : 513
## (Other):1374 (Other):1230


For illustration purposes we will fit a simple (additive) model to explain the average mark given the region and the specialization.
mAddit <- lm(avg.mark ~ fregion + fspec, data = mana)
summary(mAddit)
##
## Call:
## lm(formula = avg.mark ~ fregion + fspec, data = mana)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68845 -0.16441 -0.00517 0.15618 0.84024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.77616 0.01432 124.065 < 2e-16 ***
## fregionS 0.00887 0.01334 0.665 0.5063
## fregionC -0.01336 0.01455 -0.918 0.3585
## fregionP 0.02008 0.02197 0.914 0.3607
## fregionK -0.02198 0.02098 -1.048 0.2948
## fregionU 0.05678 0.02049 2.772 0.0056 **
## fregionL 0.02388 0.01411 1.692 0.0907 .
## fregionH 0.04861 0.02104 2.310 0.0209 *
## fregionE 0.07980 0.02049 3.894 9.98e-05 ***
## fregionJ -0.03138 0.01942 -1.616 0.1062
## fregionB 0.03236 0.01348 2.400 0.0164 *
## fregionM 0.02588 0.02159 1.199 0.2306
## fregionZ 0.01653 0.01880 0.879 0.3794
## fregionT 0.00846 0.01250 0.677 0.4984
## fspecSocial -0.15080 0.01325 -11.382 < 2e-16 ***
## fspecLang -0.26387 0.01700 -15.520 < 2e-16 ***
## fspecLaw -0.15020 0.01534 -9.789 < 2e-16 ***
## fspecMat-Phys -0.22192 0.01926 -11.525 < 2e-16 ***
## fspecEngin -0.17752 0.01403 -12.655 < 2e-16 ***
## fspecScience -0.10411 0.01448 -7.191 7.36e-13 ***
## fspecMedic -0.14369 0.01446 -9.938 < 2e-16 ***
## fspecEconom -0.16751 0.01330 -12.591 < 2e-16 ***
## fspecAgric -0.07115 0.03120 -2.281 0.0226 *
## fspecArt -0.15527 0.02030 -7.650 2.38e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.236 on 5179 degrees of freedom
## Multiple R-squared: 0.07124, Adjusted R-squared: 0.06712
## F-statistic: 17.27 on 23 and 5179 DF, p-value: < 2.2e-16
par(mar = c(4,4,1.5,0.5))
plotLM(mAddit)
Could the variability vary with the region or school specialization?
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(sqrt(abs(stdres(mAddit))) ~ mana$fregion)

plot(sqrt(abs(stdres(mAddit))) ~ mana$fspec)

In the diagnostic plots, there are some obvious issues with heteroscedasticity. We will use the sandwich estimate of the variance matrix to account for this heteroscedasticity.
Generally, for the variance matrix of \(\widehat{\boldsymbol{\beta}}\) we have
\[
\mathsf{var} \left[ \left. \widehat{\boldsymbol{\beta}} \right|
\mathbb{X} \right]
=
\mathsf{var} \left[ \left. \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}
\mathbb{X}^\top \mathbf{Y} \right| \mathbb{X} \right]
=
\left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbb{X}^\top
\mathsf{var} \left[ \left. \mathbf{Y} \right| \mathbb{X} \right]
\mathbb{X} \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}.
\] Under homoscedasticity, \(\mathsf{var} \left[ \left. \mathbf{Y} \right|
\mathbb{X} \right] = \sigma^2 \mathbb{I}\) and the variance
estimator for \(\mathsf{var} \left[ \left.
\widehat{\boldsymbol{\beta}} \right| \mathbb{X} \right]\) is
\[
\widehat{\mathsf{var}} \left[ \left. \widehat{\boldsymbol{\beta}}
\right| \mathbb{X} \right]
=
\left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbb{X}^\top
\left[\widehat{\sigma}^2 \mathbb{I} \right]
\mathbb{X} \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}
=
\widehat{\sigma}^2 \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}.
\] Under heteroscedasticity, we consider
Heteroscedasticity-Consistent
covariance matrix estimator, which differs in the estimation of the
meat \(\mathsf{var} \left[ \left.
\mathbf{Y} \right| \mathbb{X} \right] = \mathsf{var} \left[ \left.
\boldsymbol{\varepsilon} \right| \mathbb{X} \right]\). Default
choice of type = "HC3" uses \[
\widehat{\mathsf{var}} \left[ \left. \mathbf{Y} \right| \mathbb{X}
\right]
=
\mathsf{diag} \left(U_i^2 / m_{ii}^2 \right),
\] where \(\mathbf{U} = \mathbf{Y} -
\widehat{\mathbf{Y}} = \mathbf{Y} - \mathbb{X} \left(\mathbb{X}^\top
\mathbb{X}\right)^{-1} \mathbb{X}^\top \mathbf{Y} = (\mathbb{I} -
\mathbb{H}) \mathbf{Y} = \mathbb{M} \mathbf{Y}\) are the
residuals.
library(sandwich)
### Estimate of the variance matrix of the regression coefficients
VHC <- vcovHC(mAddit, type = "HC3") # type changes the "meat"
VHC[1:5,1:5]
## (Intercept) fregionS fregionC fregionP fregionK
## (Intercept) 0.0002035881 -0.0001045279 -0.0001065979 -0.0001052227 -0.0001034781
## fregionS -0.0001045279 0.0001942935 0.0001020741 0.0001023074 0.0001011661
## fregionC -0.0001065979 0.0001020741 0.0002077926 0.0001022329 0.0001006027
## fregionP -0.0001052227 0.0001023074 0.0001022329 0.0004862487 0.0001010982
## fregionK -0.0001034781 0.0001011661 0.0001006027 0.0001010982 0.0004261994
And we can compare the result with manually reconstructed estimate
X <- model.matrix(mAddit)
Bread <- solve(t(X) %*% X) %*% t(X)
Meat <- diag(residuals(mAddit)^2 / (1 - hatvalues(mAddit))^2)
(Bread %*% Meat %*% t(Bread))[1:5, 1:5]
## (Intercept) fregionS fregionC fregionP fregionK
## (Intercept) 0.0002035881 -0.0001045279 -0.0001065979 -0.0001052227 -0.0001034781
## fregionS -0.0001045279 0.0001942935 0.0001020741 0.0001023074 0.0001011661
## fregionC -0.0001065979 0.0001020741 0.0002077926 0.0001022329 0.0001006027
## fregionP -0.0001052227 0.0001023074 0.0001022329 0.0004862487 0.0001010982
## fregionK -0.0001034781 0.0001011661 0.0001006027 0.0001010982 0.0004261994
all.equal(Bread %*% Meat %*% t(Bread), VHC)
## [1] TRUE
This can be compared with the classical estimate of the variance covariance matrix - in the following, we only focus on some portion of the matrix multiplied by 10000.
10000*VHC[1:5, 1:5]
## (Intercept) fregionS fregionC fregionP fregionK
## (Intercept) 2.035881 -1.045279 -1.065979 -1.052227 -1.034781
## fregionS -1.045279 1.942935 1.020741 1.023074 1.011661
## fregionC -1.065979 1.020741 2.077926 1.022329 1.006027
## fregionP -1.052227 1.023074 1.022329 4.862487 1.010982
## fregionK -1.034781 1.011661 1.006027 1.010982 4.261994
10000*vcov(mAddit)[1:5, 1:5]
## (Intercept) fregionS fregionC fregionP fregionK
## (Intercept) 2.0495828 -1.0090753 -0.9704506 -0.9933825 -0.9449827
## fregionS -1.0090753 1.7808011 0.9400668 0.9440175 0.9342499
## fregionC -0.9704506 0.9400668 2.1167114 0.9436525 0.9366535
## fregionP -0.9933825 0.9440175 0.9436525 4.8251687 0.9394418
## fregionK -0.9449827 0.9342499 0.9366535 0.9394418 4.4005768
Note, that some variance components are overestimated by the standard
estimate (function vcov()) and some others or
underestimated.
The inference then proceeds with the same point estimate \(\widehat{\boldsymbol{\beta}}\) but a
different vcovHC matrix instead of standard
vcov(). Classical summary table can be recalculated in the
following way:
coeftest(mAddit, vcov = VHC) # from library("lmtest")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7761568 0.0142684 124.4816 < 2.2e-16 ***
## fregionS 0.0088705 0.0139389 0.6364 0.5245565
## fregionC -0.0133616 0.0144150 -0.9269 0.3540100
## fregionP 0.0200798 0.0220510 0.9106 0.3625455
## fregionK -0.0219780 0.0206446 -1.0646 0.2871117
## fregionU 0.0567816 0.0213933 2.6542 0.0079746 **
## fregionL 0.0238781 0.0141014 1.6933 0.0904561 .
## fregionH 0.0486074 0.0202311 2.4026 0.0163134 *
## fregionE 0.0798020 0.0216973 3.6780 0.0002375 ***
## fregionJ -0.0313805 0.0196311 -1.5985 0.1099897
## fregionB 0.0323584 0.0139521 2.3192 0.0204203 *
## fregionM 0.0258848 0.0242488 1.0675 0.2858099
## fregionZ 0.0165285 0.0183457 0.9009 0.3676588
## fregionT 0.0084599 0.0126113 0.6708 0.5023660
## fspecSocial -0.1507984 0.0128745 -11.7129 < 2.2e-16 ***
## fspecLang -0.2638676 0.0168792 -15.6327 < 2.2e-16 ***
## fspecLaw -0.1502042 0.0152848 -9.8271 < 2.2e-16 ***
## fspecMat-Phys -0.2219191 0.0188825 -11.7526 < 2.2e-16 ***
## fspecEngin -0.1775146 0.0141988 -12.5021 < 2.2e-16 ***
## fspecScience -0.1041070 0.0143710 -7.2442 4.988e-13 ***
## fspecMedic -0.1436849 0.0141332 -10.1665 < 2.2e-16 ***
## fspecEconom -0.1675115 0.0128221 -13.0643 < 2.2e-16 ***
## fspecAgric -0.0711497 0.0271456 -2.6210 0.0087917 **
## fspecArt -0.1552720 0.0222920 -6.9654 3.684e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Unfortunately, function predict.lm is not as flexible
and confidence intervals need to be computed manually
newdata <- data.frame(fregion = "A",
fspec = levels(mana$fspec),
avg.mark = 0)
# predict no sandwich
pred_vcov <- predict(mAddit, newdata = newdata, interval = "confidence")
# predict with sandwich (manually)
L <- model.matrix(mAddit, data = newdata)
se.fit <- sqrt(diag(L %*% VHC %*% t(L)))
pred_VHC <- matrix(NA, nrow = nrow(L), ncol = 3)
pred_VHC[,1] <- L %*% coef(mAddit)
pred_VHC[,2] <- pred_VHC[,1] - se.fit * qt(0.975, df = mAddit$df.residual)
pred_VHC[,3] <- pred_VHC[,1] + se.fit * qt(0.975, df = mAddit$df.residual)
colnames(pred_VHC) <- c("fit", "lwr", "upr")
rownames(pred_VHC) <- rownames(pred_vcov)
all.equal(pred_vcov[,"fit"], pred_VHC[,"fit"])
## [1] TRUE
all.equal(pred_vcov[,"lwr"], pred_VHC[,"lwr"])
## [1] "Mean relative difference: 0.001029055"
all.equal(pred_vcov[,"upr"], pred_VHC[,"upr"])
## [1] "Mean relative difference: 0.0009898677"
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(x = 1:nlevels(mana$fspec), y = pred_vcov[,"fit"],
xlab = "Specialization", ylab = "Expected average mark, Prague region",
xaxt = "n", ylim = range(pred_vcov, pred_VHC),
col = "red4", pch = 16, cex = 2)
axis(1, at = 1:nlevels(mana$fspec), labels = levels(mana$fspec))
arrows(x0 = 1:nlevels(mana$fspec), y0 = pred_vcov[,"fit"], y1 = pred_vcov[,"lwr"],
angle = 80, lwd = 2, col = "red")
arrows(x0 = 1:nlevels(mana$fspec), y0 = pred_vcov[,"fit"], y1 = pred_vcov[,"upr"],
angle = 80, lwd = 2, col = "red")
arrows(x0 = 1:nlevels(mana$fspec), y0 = pred_vcov[,"fit"], y1 = pred_VHC[,"lwr"],
angle = 80, lwd = 2, col = "navyblue")
arrows(x0 = 1:nlevels(mana$fspec), y0 = pred_vcov[,"fit"], y1 = pred_VHC[,"upr"],
angle = 80, lwd = 2, col = "navyblue")
legend("top", legend = c("vcov()", "vcovHC(type = 'HC3')"),
ncol = 2, bty = "n", col = c("red", "navyblue"), lwd = 2)

Outline of this lab session:
We will use a small dataset (47 observations and 8 different covariates) containing some pieces of information about fires that occurred in Chicago in 1970 while distinguishing for some information provided in the data. Particularly, we will focus on the locality in Chicago (north vs. south) and the proportion of minority citizens. The dataset is loaded from the website and some brief insight is provided below.
library("mffSM") # plotLM()
library("MASS") # stdres()
library("fBasics") # dagoTest()
library("car") # ncvTest()
library("lmtest") # bptest()
chicago <- read.csv("http://www.karlin.mff.cuni.cz/~vavraj/nmfm334/data/chicago.csv",
header=T, stringsAsFactors = TRUE)
chicago <- transform(chicago,
fside = factor(side, levels = 0:1, labels=c("North", "South")))
head(chicago)
## minor fire theft old insur income side fside
## 1 10.0 6.2 29 60.4 0.0 11.744 0 North
## 2 22.2 9.5 44 76.5 0.1 9.323 0 North
## 3 19.6 10.5 36 73.5 1.2 9.948 0 North
## 4 17.3 7.7 37 66.9 0.5 10.656 0 North
## 5 24.5 8.6 53 81.4 0.7 9.730 0 North
## 6 54.0 34.1 68 52.6 0.3 8.231 0 North
summary(chicago)
## minor fire theft old insur income
## Min. : 1.00 Min. : 2.00 Min. : 3.00 Min. : 2.00 Min. :0.0000 Min. : 5.583
## 1st Qu.: 3.75 1st Qu.: 5.65 1st Qu.: 22.00 1st Qu.:48.60 1st Qu.:0.0000 1st Qu.: 8.447
## Median :24.50 Median :10.40 Median : 29.00 Median :65.00 Median :0.4000 Median :10.694
## Mean :34.99 Mean :12.28 Mean : 32.36 Mean :60.33 Mean :0.6149 Mean :10.696
## 3rd Qu.:57.65 3rd Qu.:16.05 3rd Qu.: 38.00 3rd Qu.:77.30 3rd Qu.:0.9000 3rd Qu.:11.989
## Max. :99.70 Max. :39.70 Max. :147.00 Max. :90.10 Max. :2.2000 Max. :21.480
## side fside
## Min. :0.0000 North:25
## 1st Qu.:0.0000 South:22
## Median :0.0000
## Mean :0.4681
## 3rd Qu.:1.0000
## Max. :1.0000
The data contain information from Chicago insurance redlining in 47 districts in 1970 where
minor stands for the percentage of minority (0 - 100)
in a given district,fire stands for the number of fires per 100 households
during the reference period,theft denotes the number of reported thefts per 1000
inhabitants,old gives percentage of residential units built before
1939,insur states the number of policies per 100
household,income is the median income (USD 1000) per
household,side gives the location of the district within Chicago
(0 = North, 1 = South).Use the data on fires in Chicago (the dataset loaded above) and
perform some basic explanatory analysis while focusing on the number of
fires (the dependent variable fire and two explanatory
variables the proportion of the minority (minor) and the
location in Chicago (side).
What are you expectations about the form of the regression function when being interested in modeling of the conditional number of fires \(\mathsf{E} [Y |\, \mathtt{minor}, \mathtt{side}]\), while conditioning on the minority proportion and the locality information?
Are there some issues that you should be concern with? What about the assumptions on normal regression model or an ordinary regression model (without the assumption of normality)?
What particular graphical tools can be used to visually verify the assumptions imposed on the model?
We start by looking for a model of dependence of the number of fires
(fire) on the percentage of minority (minor).
We are also aware of the fact that the form of this dependence may
differ in north and south parts of Chicago.
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
XLAB <- "Minority (%)"
YLAB <- "Fires (#/100 households)"
COLS <- c(North = "darkgreen", South = "red")
BGS <- c(North = "aquamarine", South = "orange")
PCHS <- c(North = 21, South = 23)
plot(fire ~ minor, data = chicago,
xlab = XLAB, ylab = YLAB,
col = COLS[fside], bg = BGS[fside], pch = PCHS[fside])
legend("topleft", legend = c("North", "South"), bty = "n",
pch = PCHS, col = COLS, pt.bg = BGS)
lines(with(subset(chicago, fside == "North"), lowess(minor, fire)),
col = COLS["North"], lwd = 2)
lines(with(subset(chicago, fside == "South"), lowess(minor, fire)),
col = COLS["South"], lwd = 2)

Lets compare the following three models. Based on the output describe the effect of the percentage of minority on the number of fires. Describe also the effect of the side.
m <- list()
m[["Line"]] <- lm(fire ~ minor * fside, data = chicago)
m[["Log2"]] <- lm(fire ~ log2(minor) * fside, data = chicago)
m[["Parab"]] <- lm(fire ~ (minor + I(minor^2)) * fside, data = chicago)
summary(m[["Line"]])
##
## Call:
## lm(formula = fire ~ minor * fside, data = chicago)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.262 -3.340 -1.474 3.075 18.303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.19515 1.69212 3.070 0.0037 **
## minor 0.32274 0.04896 6.592 5.03e-08 ***
## fsideSouth 1.37454 3.10252 0.443 0.6600
## minor:fsideSouth -0.20812 0.06589 -3.159 0.0029 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.535 on 43 degrees of freedom
## Multiple R-squared: 0.5387, Adjusted R-squared: 0.5065
## F-statistic: 16.74 on 3 and 43 DF, p-value: 2.376e-07
summary(m[["Log2"]])
##
## Call:
## lm(formula = fire ~ log2(minor) * fside, data = chicago)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5972 -4.6595 -0.8746 2.2849 17.3832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1727 2.2718 -0.076 0.9397
## log2(minor) 3.9807 0.5982 6.655 4.08e-08 ***
## fsideSouth 2.9180 4.1885 0.697 0.4898
## log2(minor):fsideSouth -2.0180 0.8960 -2.252 0.0295 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.441 on 43 degrees of freedom
## Multiple R-squared: 0.5518, Adjusted R-squared: 0.5206
## F-statistic: 17.65 on 3 and 43 DF, p-value: 1.291e-07
summary(m[["Parab"]])
##
## Call:
## lm(formula = fire ~ (minor + I(minor^2)) * fside, data = chicago)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7633 -3.9179 -0.5716 3.0956 14.3219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.021900 1.822544 1.109 0.27373
## minor 0.754990 0.144585 5.222 5.48e-06 ***
## I(minor^2) -0.005287 0.001685 -3.137 0.00315 **
## fsideSouth 1.721106 3.419367 0.503 0.61742
## minor:fsideSouth -0.435921 0.194558 -2.241 0.03053 *
## I(minor^2):fsideSouth 0.003173 0.002118 1.498 0.14180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.855 on 41 degrees of freedom
## Multiple R-squared: 0.6469, Adjusted R-squared: 0.6038
## F-statistic: 15.02 on 5 and 41 DF, p-value: 2.228e-08
And the corresponding diagnostics:
par(mar = c(4,4,2,0.5))
plotLM(m[["Line"]])

plotLM(m[["Log2"]])

plotLM(m[["Parab"]])

For all three models there can be some issues spotted in the diagnostic plots above (heteroscedasticity, conditional mean specification, maybe normality issues).
Some statistical tests for normality applied to residuals that are heteroscedastic (even in homoscedastic model):
shapiro.test(residuals(m[["Line"]]))
##
## Shapiro-Wilk normality test
##
## data: residuals(m[["Line"]])
## W = 0.90361, p-value = 0.0009409
dagoTest(residuals(m[["Line"]]))
##
## Title:
## D'Agostino Normality Test
##
## Test Results:
## STATISTIC:
## Chi2 | Omnibus: 9.4764
## Z3 | Skewness: 2.1817
## Z4 | Kurtosis: 2.1717
## P VALUE:
## Omnibus Test: 0.008754
## Skewness Test: 0.02913
## Kurtosis Test: 0.02987
Some statistical tests for normality applied to standardized residuals that are not normal (even in normal linear model):
shapiro.test(stdres(m[["Line"]]))
##
## Shapiro-Wilk normality test
##
## data: stdres(m[["Line"]])
## W = 0.90005, p-value = 0.0007228
dagoTest(stdres(m[["Line"]]))
##
## Title:
## D'Agostino Normality Test
##
## Test Results:
## STATISTIC:
## Chi2 | Omnibus: 7.9183
## Z3 | Skewness: 1.3467
## Z4 | Kurtosis: 2.4707
## P VALUE:
## Omnibus Test: 0.01908
## Skewness Test: 0.1781
## Kurtosis Test: 0.01348
Statistical test for homoscedasticity (Breusch-Pagan test and Koenker’s studentized version) with alternatives \(\mathsf{var} \left[ Y_i | \mathbf{X}_i\right] = \sigma^2 \exp\left\{\tau \mathsf{E} \left[Y_i | \mathbf{X}_i\right]\right\}\) or in general \(\mathsf{var} \left[ Y_i | \mathbf{X}_i, \mathbf{Z_i}\right] = \sigma^2 \exp\left\{\mathbf{Z}_i^\top \boldsymbol{\tau}\right\}\):
ncvTest(m[["Line"]]) # default is fitted
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 25.2439, Df = 1, p = 5.0519e-07
ncvTest(m[["Line"]], var.formula = ~fitted(m[["Line"]])) # the same as before
## Non-constant Variance Score Test
## Variance formula: ~ fitted(m[["Line"]])
## Chisquare = 25.2439, Df = 1, p = 5.0519e-07
bptest(m[["Line"]], varformula = ~fitted(m[["Line"]])) # Koenker's studentized version (robust to non-normality)
##
## studentized Breusch-Pagan test
##
## data: m[["Line"]]
## BP = 13.692, df = 1, p-value = 0.0002153
bptest(m[["Line"]]) # default is the same as model formula
##
## studentized Breusch-Pagan test
##
## data: m[["Line"]]
## BP = 15.212, df = 3, p-value = 0.001644
In addition, there is also a bunch of points scattered around the origin and relatively only a few observations are given for larger proportion of the minority or the larger amount of fires. This suggests that log transformation (of the response and the predictor as well) could help.
Typically, when the residual variance increases with the response expectation, log-transformation of the response often ensures a homoscedastic model (Assumption A3a from the lecture).
Firstly, compare the plot below (on the log scale on \(y\) axes) with the previous plot on the original scale on the \(y\) axes:
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
LYLAB <- "Log(Fires)"
plot(log(fire) ~ minor, data = chicago,
xlab = XLAB, ylab = LYLAB,
col = COLS[fside], bg = BGS[fside], pch = PCHS[fside])
legend("topleft", legend = c("North", "South"), bty = "n",
pch = PCHS, col = COLS, pt.bg = BGS)
lines(with(subset(chicago, fside == "North"), lowess(minor, log(fire))),
col = COLS["North"], lwd = 2)
lines(with(subset(chicago, fside == "South"), lowess(minor, log(fire))),
col = COLS["South"], lwd = 2)

Using the proposed transformation of the response we also change the underlying data. This means, that instead of the model \[ \mathsf{E} [Y_i | \boldsymbol{X}_i] = \boldsymbol{X}_i^\top \boldsymbol{\beta} \] fitted based on the random sample \(\{(Y_i, \boldsymbol{X}_i);\; i = 1, \dots, n\}\) we fit another model of the form \[ \mathsf{E} [\log(Y_i) | \boldsymbol{X}_i] = \boldsymbol{X}_i^\top \widetilde{\boldsymbol{\beta}} \] which is, however, based on the different random sample \(\{(\log(Y_i), \boldsymbol{X}_i);\;i = 1, \dots, n\}\). This also has some crucial consequences. The logarithmic transformation will help to deal with some issues regarding heteroscedastic data but, on the other hand, it will introduce new problems regarding the model interpretability.
The idea is that the final model should be (always) interpreted with
respect to the original data - the information about
the number of fires (fire) - not the logarithm of the
number of fires (log(fire)).
For some better interpretation we will also use the log
transformation of the independent variable – the information about the
proportion of minority – minor. What is the interpretation
of the following model:
summary(m <- lm(log(fire) ~ log2(minor) * fside, data = chicago))
##
## Call:
## lm(formula = log(fire) ~ log2(minor) * fside, data = chicago)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75178 -0.33949 -0.07901 0.34564 0.84890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.01616 0.14635 6.943 1.55e-08 ***
## log2(minor) 0.35961 0.03853 9.332 6.74e-12 ***
## fsideSouth 0.27963 0.26983 1.036 0.3058
## log2(minor):fsideSouth -0.14411 0.05772 -2.497 0.0164 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4149 on 43 degrees of freedom
## Multiple R-squared: 0.7278, Adjusted R-squared: 0.7088
## F-statistic: 38.33 on 3 and 43 DF, p-value: 3.234e-12
The diagnostic plots have improved in terms of mean and variance:
par(mar = c(4,4,2,0.5))
plotLM(m)

Recall the Jensen inequality and the fact that \[ \mathsf{E} [\log(Y_i) | \boldsymbol{X}_i] \neq \log(\mathsf{E} [Y_i | \boldsymbol{X}_i]) \qquad \text{but rather} \qquad \mathsf{E} [\log(Y_i) | \boldsymbol{X}_i] < \log(\mathsf{E} [Y_i | \boldsymbol{X}_i]). \] And now compare the two models: \[ \begin{aligned} Y_i &= \mathbf{X}_i^\top \boldsymbol{\beta} + \varepsilon_i, \\ \log Y_i = \mathbf{X}_i^\top \widetilde{\boldsymbol{\beta}} + \widetilde{\varepsilon}_i \quad\Longleftrightarrow\quad Y_i &= \exp\left\{\mathbf{X}_i^\top \widetilde{\boldsymbol{\beta}}\right\} \cdot \exp\left\{ \widetilde{\varepsilon}_i \right\}. \end{aligned} \]