Statistical inference in a normal linear model

Download this R markdown as: R, Rmd.

Outline of this lab session:

Statistical inference in a normal linear model
Confidence vs prediction intervals
Confidence regions
Bounds around and for the linear regression line

Loading the data and libraries

library("mffSM")         # Dris data and LSest() function
library("car")           # confidenceEllipse() function
library("colorspace")    # rainbow_hcl() colors

We will work again with the dataset Dris and peat from previous exercises.

peat <- read.csv("http://www.karlin.mff.cuni.cz/~vavraj/nmfm334/data/peat.csv", 
                 header = TRUE, stringsAsFactors = TRUE)
data(Dris, package = "mffSM")

Theory review

As usual, we start with the general notation for linear regression model \[ \mathbf{Y} = \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{pmatrix} = \begin{pmatrix} \mathbf{X_1}^\top \\ \mathbf{X_2}^\top \\ \vdots \\\mathbf{X_n}^\top \end{pmatrix} \begin{pmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}. \] For simplicity, we will assume the following normality assumption for the error terms: \(\boldsymbol{\varepsilon} \sim \mathsf{N}_n(\mathbf{0}, \sigma^2 \mathbb{I}_n)\), which hides several other assumptions:

independence of the observations,
zero mean, \(\mathsf{E}\, \varepsilon_i = 0, \forall i\),
homoscedasticity, \(\mathsf{var}\, \varepsilon_i = \sigma^2 \forall i\),
normality.

Given the regression matrix, the term \(\mathbb{X} \boldsymbol{\beta}\) only shifts the mean. Hence, \(\mathbf{Y} \,|\, \mathbb{X} \sim \mathsf{N}(\mathbb{X} \boldsymbol{\beta}, \sigma^2 \mathbb{I}_n)\). The least-squares estimator \(\widehat{\boldsymbol{\beta}} = \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbb{X}^\top \mathbf{Y}\) is then just a linear transformation of \(\mathbf{Y}\) which preserves normality: \[ \widehat{\boldsymbol{\beta}} \,|\, \mathbb{X} \sim \mathsf{N}_p \left(\boldsymbol{\beta}, \sigma^2 \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\right). \] For any linear combination of coefficients \(\mathbf{c} \in \mathbb{R}^p\) we obtain \[ \mathbf{c}^\top \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \,\Big|\, \mathbb{X} \sim \mathsf{N} \left(\mathbf{0}, \sigma^2 \mathbf{c}^\top\left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{c}\right). \] As the residual sum of squares is independent of \(\widehat{\boldsymbol{\beta}}\) and \(\dfrac{\widehat{\sigma}_n^2 (n-p)}{\sigma^2} = \frac{\mathsf{RSS}}{\sigma^2} \sim \chi^2_{n-p}\), which finally gives us \[ \dfrac{\mathbf{c}^\top \left( \widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)}{\sqrt{\widehat{\sigma}_n^2\mathbf{c}^\top\left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{c}}} = \dfrac{ \dfrac{\mathbf{c}^\top \widehat{\boldsymbol{\beta}} - \mathbf{c}^\top\boldsymbol{\beta}}{\sqrt{\sigma^2\mathbf{c}^\top\left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{c}}} }{ \sqrt{\frac{\mathsf{RSS}}{\sigma^2(n-p)}} } \,{\Huge|}\, \mathbb{X} \sim \mathsf{t}_{n-p}. \]

Let us denote \(\theta = \mathbf{c}^\top \boldsymbol{\beta}\) and \(\widehat{\theta} = \mathbf{c}^\top \widehat{\boldsymbol{\beta}}\), then statistical test for \(\mathrm{H}_0: \theta = \theta_0\) is desired. Under the null hypothesis, the test statistic \[ T = \dfrac{\widehat{\theta} - \theta_0}{\sqrt{\widehat{\sigma}_n^2\mathbf{c}^\top\left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{c}}} \,{\Huge|}\, \mathbb{X} \sim \mathsf{t}_{n-p}, \] which can be used for construction of two-sided and one-sided tests on significance level \(\alpha\) depending on the choice of the alternative:

if \(\mathrm{H}_1: \theta \neq \theta_0\), then \(\mathrm{H}_0\) is rejected if \(|T| \geq \mathsf{t}_{n-p}(1-\frac{\alpha}{2})\) with p-value \(p = 2 F_{\mathsf{t}_{n-p}} (-|t|)\),
if \(\mathrm{H}_1: \theta > \theta_0\) (greater), then \(\mathrm{H}_0\) is rejected if \(T \geq \mathsf{t}_{n-p}(1-\alpha)\) with p-value \(p = 1-F_{\mathsf{t}_{n-p}} (t)\),
if \(\mathrm{H}_1: \theta < \theta_0\) (less), then \(\mathrm{H}_0\) is rejected if \(T \leq \mathsf{t}_{n-p}(\alpha)\) with p-value \(p = F_{\mathsf{t}_{n-p}} (t)\).

Different applications

Within this section we will work with peat dataset where we realised that fitting different lines for the dependence of N on depth for each group is desired:

fit3 <- lm(N ~ depth * group, data = peat)

XLAB <- "Depth [cm]"
YLAB <- "Nitrogen concentration [weight %]"
XLIM <- range(peat[, "depth"])
YLIM <- range(peat[, "N"])

PCH <- c(21, 22, 24, 25)
DARK <- rainbow_hcl(4, c = 80, l = 35)
COL <- rainbow_hcl(4, c = 80, l = 50)
BGC <- rainbow_hcl(4, c = 80, l = 70)
TRC <- rainbow_hcl(4, c = 80, l = 70, alpha = 0.3)
names(PCH) <- names(COL) <- names(BGC) <- names(DARK) <- names(TRC) <- levels(peat[, "group"])

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(N ~ jitter(depth), data = peat, 
     pch = PCH[group], col = COL[group], bg = BGC[group],
     xlab = XLAB, ylab = YLAB, xlim = XLIM, ylim = YLIM)
intercepts <- coef(fit3)[1] + c(0, coef(fit3)[3:5])
slopes <- coef(fit3)[2] + c(0, coef(fit3)[6:8])
for(g in 1:4){
  abline(a = intercepts[g], b = slopes[g], col = DARK[g], lwd = 2)
}
legend("topleft", legend = levels(peat$group), ncol = 2, title = "Group",
       col = COL, pch = PCH, pt.bg = BGC, lty = 1, lwd = 2)

Testing \(\beta_j = 0\)

We already know that the choice \(\mathbf{c} = \mathbf{e}_j = (0, \ldots, 0, 1, 0, \ldots, 0)^\top\) with 1 on \(j\)-th position leads to \(\theta = \beta_j\). Two-sided tests for \(\mathrm{H}_0: \theta = 0\) are provided by the summary output:

(sumfit3 <- summary(fit3))

## 
## Call:
## lm(formula = N ~ depth * group, data = peat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36829 -0.08469 -0.00789  0.06112  0.37713 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.846956   0.051193  16.544  < 2e-16 ***
## depth              0.005048   0.005724   0.882    0.379    
## groupCB-VJJ        0.322113   0.064417   5.000 1.68e-06 ***
## groupVJJ          -0.064041   0.075374  -0.850    0.397    
## groupVJJ-CB        0.075906   0.064417   1.178    0.241    
## depth:groupCB-VJJ -0.032600   0.007389  -4.412 2.03e-05 ***
## depth:groupVJJ     0.043942   0.008312   5.287 4.68e-07 ***
## depth:groupVJJ-CB  0.041591   0.007389   5.629 9.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1354 on 140 degrees of freedom
## Multiple R-squared:  0.7306, Adjusted R-squared:  0.7171 
## F-statistic: 54.23 on 7 and 140 DF,  p-value: < 2.2e-16

The corresponding confidence intervals could be easily obtained:

confint(fit3)

##                          2.5 %      97.5 %
## (Intercept)        0.745743956  0.94816772
## depth             -0.006267384  0.01636428
## groupCB-VJJ        0.194757616  0.44946860
## groupVJJ          -0.213059130  0.08497760
## groupVJJ-CB       -0.051449561  0.20326143
## depth:groupCB-VJJ -0.047209047 -0.01799169
## depth:groupVJJ     0.027509039  0.06037546
## depth:groupVJJ-CB  0.026982539  0.05619989

How can we adjust the p-value to correspond to one-sided test?

cbind("less"    = pt(sumfit3$coefficients[,"t value"], df = fit3$df.residual, lower.tail = T),
      "greater" = pt(sumfit3$coefficients[,"t value"], df = fit3$df.residual, lower.tail = F))

##                           less      greater
## (Intercept)       1.000000e+00 4.734139e-35
## depth             8.103674e-01 1.896326e-01
## groupCB-VJJ       9.999992e-01 8.420753e-07
## groupVJJ          1.984878e-01 8.015122e-01
## groupVJJ-CB       8.796731e-01 1.203269e-01
## depth:groupCB-VJJ 1.014731e-05 9.999899e-01
## depth:groupVJJ    9.999998e-01 2.338631e-07
## depth:groupVJJ-CB 1.000000e+00 4.781560e-08

What about the confidence intervals? The function confint could be tricked to give you the finite bound of the interval. However, one must realize the correct level and to which test each bound corresponds.

bounds <- confint(fit3, level = 0.9)     # 10% on both sides --> 5% on one side
bounds <- cbind(rep(-Inf, length(coef(fit3))), bounds[,2], 
                bounds[,1], rep(Inf, length(coef(fit3))))
colnames(bounds) <- c("less-lower", "less-upper", 
                      "greater-lower", "greater-upper")
bounds

##                   less-lower  less-upper greater-lower greater-upper
## (Intercept)             -Inf  0.93172223   0.762189452           Inf
## depth                   -Inf  0.01452562  -0.004428721           Inf
## groupCB-VJJ             -Inf  0.42877514   0.215451078           Inf
## groupVJJ                -Inf  0.06076423  -0.188845757           Inf
## groupVJJ-CB             -Inf  0.18256797  -0.030756098           Inf
## depth:groupCB-VJJ       -Inf -0.02036540  -0.044835344           Inf
## depth:groupVJJ          -Inf  0.05770530   0.030179203           Inf
## depth:groupVJJ-CB       -Inf  0.05382619   0.029356242           Inf

Testing \(\beta_j = \theta_0\)

What if a different value of \(\theta_0\) than zero is desired? We need to update the test statistic by subtracting \(\theta_0\) in the numerator. In the following, we again provide tests for all coefficients with \(\theta_0 = 1\):

theta0 <- 1
tt <- (sumfit3$coefficients[,"Estimate"] - theta0) / sumfit3$coefficients[,"Std. Error"]
cbind("two-sided" = 2*pt(-abs(tt), df = fit3$df.residual),
      "less"      = pt(tt, df = fit3$df.residual, lower.tail = T),
      "greater"   = pt(tt, df = fit3$df.residual, lower.tail = F))

##                       two-sided          less   greater
## (Intercept)        3.301432e-03  1.650716e-03 0.9983493
## depth             1.989703e-165 9.948513e-166 1.0000000
## groupCB-VJJ        1.926303e-19  9.631515e-20 1.0000000
## groupVJJ           1.073963e-28  5.369813e-29 1.0000000
## groupVJJ-CB        2.821148e-29  1.410574e-29 1.0000000
## depth:groupCB-VJJ 3.117835e-152 1.558917e-152 1.0000000
## depth:groupVJJ    1.698601e-140 8.493007e-141 1.0000000
## depth:groupVJJ-CB 9.833962e-148 4.916981e-148 1.0000000

Comparing different groups

Many coefficients have already some interpretation that compares the groups. For example, the coefficient "groupVJJ" estimates \(\mathsf{E}[N|D=0,G=\text{"VJJ"}] - \mathsf{E}[N|D=0,G=\text{"CB"}]\), the expected difference in nitrogen concentration between groups VJJ and CB at zero depth.

What if we would like to compare different groups at zero depth?

Simple solution would be to change the reference category:

peat$group2 <- relevel(peat$group, ref = "VJJ")
summary(peat$group2) # VJJ became the reference category

##    VJJ     CB CB-VJJ VJJ-CB 
##     33     35     40     40

summary(lm(N ~ depth * group2, data = peat))

## 
## Call:
## lm(formula = N ~ depth * group2, data = peat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36829 -0.08469 -0.00789  0.06112  0.37713 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.782915   0.055322  14.152  < 2e-16 ***
## depth               0.048991   0.006027   8.128 2.08e-13 ***
## group2CB            0.064041   0.075374   0.850   0.3970    
## group2CB-VJJ        0.386154   0.067744   5.700 6.81e-08 ***
## group2VJJ-CB        0.139947   0.067744   2.066   0.0407 *  
## depth:group2CB     -0.043942   0.008312  -5.287 4.68e-07 ***
## depth:group2CB-VJJ -0.076543   0.007627 -10.036  < 2e-16 ***
## depth:group2VJJ-CB -0.002351   0.007627  -0.308   0.7583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1354 on 140 degrees of freedom
## Multiple R-squared:  0.7306, Adjusted R-squared:  0.7171 
## F-statistic: 54.23 on 7 and 140 DF,  p-value: < 2.2e-16

Now the coefficients estimate the differences to category VJJ. However, it required to refit the model again and changed the interpretation of all other coefficients. This is inadvisable, if many different comparisons are required.

Let’s go back to fit3 with group CB as the reference category. Then, the difference between VJJ and VJJ-CB at zero depth value could be expressed as \[ \mathsf{E}[N|D=0,G=\text{"VJJ"}] - \mathsf{E}[N|D=0,G=\text{"VJJ-CB"}] = (\beta_1 + \beta_4) - (\beta_1 + \beta_5) = \beta_4 - \beta_5. \] Now we just need to realize, that this is just some linear combination of the coefficients: \[ \beta_4 - \beta_5 = (0,0,0,1,-1,0,0,0)^\top \boldsymbol{\beta}, \] which inspires us to use special form of \(\mathbf{c} = (0,0,0,1,-1,0,0,0)^\top\). Hypothesis \(\mathrm{H}_0: \beta_4 = \beta_5\) then can be alternatively written as \[ \mathrm{H}_0: \theta := \mathbf{c}^\top \boldsymbol{\beta} = \beta_4 - \beta_5 \qquad = 0. \]

How do we do that with R?

comb <- c(0,0,0,1,-1,0,0,0)
theta0 <- 0
est.theta <- comb %*% coef(fit3)
var.theta <- t(comb) %*% vcov(fit3) %*% comb
se.theta <- sqrt(diag(var.theta))
t.theta <- (est.theta - theta0)/se.theta

# p-values
c("two-sided" = 2*pt(-abs(t.theta), df = fit3$df.residual),
  "less"      = pt(t.theta, df = fit3$df.residual, lower.tail = T),
  "greater"   = pt(t.theta, df = fit3$df.residual, lower.tail = F))

##  two-sided       less    greater 
## 0.04068962 0.02034481 0.97965519

# confidence interval
c("lower" = est.theta - qt(0.975, df = fit3$df.residual) * se.theta,
  "upper" = est.theta + qt(0.975, df = fit3$df.residual) * se.theta)

##        lower        upper 
## -0.273880280 -0.006013115

Alternatively, we can use function LSest from mffSM package that can handle even multiple linear combinations at the same time:

# the same test as before
LSest(fit3, comb, alternative = "two.sided", conf.level = 0.95, theta0 = 0)

##     Estimate Std. Error   t value P value      Lower        Upper
## 1 -0.1399467 0.06774403 -2.065816 0.04069 -0.2738803 -0.006013115

# multiple tests at the same time - fill the rows of matrix L with the linear combination
L <- matrix(c(0,0,1,0,-1,0,0,0,
              0,0,0,1,-1,0,0,0,
              0,0,1,-1,0,0,0,0), nrow = 3, byrow = T)
LSest(fit3, L = L)

##     Estimate Std. Error   t value    P value      Lower        Upper
## 1  0.2462072 0.05529507  4.452606 1.7202e-05  0.1368858  0.355528506
## 2 -0.1399467 0.06774403 -2.065816    0.04069 -0.2738803 -0.006013115
## 3  0.3861539 0.06774403  5.700191 6.8138e-08  0.2522203  0.520087456

However, keep in mind that these p-values are not corrected for multiple testing.

Predicting for specific covariate values

The conditional expectation of the response is modeled as a linear combination of covariates. Therefore, its prediction for specific covariate values could be viewed as a specific linear combination of the estimated coefficients: \[ \theta = \mathsf{E}\, \left[Y_{\mathrm{new}} \mid \mathbf{X}_{\mathrm{new}} = \mathbf{x}_{\mathrm{new}} \right] = \mathbf{x}_{\mathrm{new}}^\top \boldsymbol{\beta} \quad\Longrightarrow\quad \mathbf{c} = \mathbf{x}_{\mathrm{new}}. \] Assume we wish to predict the expected Nitrogen concentration:

in group CB-VJJ at depth 3,
in group VJJ at depth 7.

The use of LSest function requires us to define the linear combinations where one can easily make some mistake:

coef(fit3)

##       (Intercept)             depth       groupCB-VJJ          groupVJJ       groupVJJ-CB depth:groupCB-VJJ 
##       0.846955840       0.005048449       0.322113110      -0.064040764       0.075905933      -0.032600370 
##    depth:groupVJJ depth:groupVJJ-CB 
##       0.043942252       0.041591216

L <- matrix(c(1, 3, 1, 0, 0, 3, 0, 0,
              1, 7, 0, 1, 0, 0, 7, 0), nrow = 2, byrow = T)
LSest(fit3, L, theta0 = 1)

##   Estimate Std. Error  t value    P value    Lower    Upper
## 1 1.086413 0.02842649 3.039882  0.0028253 1.030212 1.142614
## 2 1.125850 0.02485157 5.064065 1.2716e-06 1.076717 1.174983

It is much more convenient to use predict function for such situation, we only need to supply a data.frame containing all needed covariates:

newdata <- data.frame(depth = c(3,7), group = c("CB-VJJ", "VJJ"))
predict(fit3, newdata = newdata, level = 0.95, interval = "confidence")

##        fit      lwr      upr
## 1 1.086413 1.030212 1.142614
## 2 1.125850 1.076717 1.174983

However, it only provides the point estimator and confidence interval, but not p-value. The function is not designed to perform statistical testing, only predictions.

Comparing expectations for two different settings

Our task is to perform statistical test of hypothesis \[ \mathrm{H}_0: \mathsf{E} \left[Y \mid \mathbf{X} = \mathbf{x}_1 \right] = \mathsf{E} \left[Y \mid \mathbf{X} = \mathbf{x}_2 \right], \] where \(\mathbf{x}_1\) and \(\mathbf{x}_2\) are different covariate values. The null hypothesis could be rewritten as \[ \mathrm{H}_0: \mathsf{E} \left[Y \mid \mathbf{X} = \mathbf{x}_1 \right] - \mathsf{E} \left[Y \mid \mathbf{X} = \mathbf{x}_2 \right] = \mathbf{x}_1^\top \boldsymbol{\beta} - \mathbf{x}_2^\top \boldsymbol{\beta} = \left(\mathbf{x}_1- \mathbf{x}_2\right)^\top \boldsymbol{\beta} \qquad = 0. \] Then, all we need to do is to set \(\mathbf{c} = \mathbf{x}_1- \mathbf{x}_2\).

The test can be again performed by LSest or by manual calculations. However, we need to carefully compute the vector determining the linear combinations:

# Manually, where an error can easily occur...
L <- matrix(c(1, 3, 1, 0, 0, 3, 0, 0,
              1, 7, 0, 1, 0, 0, 7, 0), nrow = 2, byrow = T)
(comb <- L[1,] - L[2,])

## [1]  0 -4  1 -1  0  3 -7  0

# Clever use of model.matrix()
newdata <- data.frame(depth = c(3,7), group = c("CB-VJJ", "VJJ"), 
                      N = c(0,0)) # requires some arbitrary response values, otherwise error
L <- model.matrix(fit3, data = newdata)
(comb <- L[1,] - L[2,])

##       (Intercept)             depth       groupCB-VJJ          groupVJJ       groupVJJ-CB depth:groupCB-VJJ 
##                 0                -4                 1                -1                 0                 3 
##    depth:groupVJJ depth:groupVJJ-CB 
##                -7                 0

LSest(fit3, comb)

##      Estimate Std. Error   t value P value      Lower      Upper
## 1 -0.03943679 0.03775799 -1.044462 0.29807 -0.1140864 0.03521279

Prediction intervals

Now, we are interested in predicting a new observation \(\mathbf{Y}_{\mathrm{new}} = \mathbf{x}_{\mathrm{new}}^\top \boldsymbol{\beta} + \varepsilon_\mathrm{new}\) given a set of new covariates \(\mathbf{X}_{\mathrm{new}} = \mathbf{x}_{\mathrm{new}}\) - the actual value of \(\mathbf{Y}_{\mathrm{new}}\), not just \(\mathsf{E}\, \left[\mathbf{Y}_{\mathrm{new}} | \mathbf{X}_{\mathrm{new}} = \mathbf{x}_{\mathrm{new}}\right] = \mathbf{x}_{\mathrm{new}}^\top \boldsymbol{\beta}\). The point estimator remains unchanged since the expected value of \(\varepsilon_{\mathrm{new}}\) is zero. Huge change comes happens with the width of the prediction intervals compared to the confidence intervals.

Since training data \(\mathbf{Y}, \mathbb{X}\) and \(\varepsilon_{\mathrm{new}}\) are assumed to be independent, we can combine two random variables together and obtain: \[ \left. \mathbf{x}_{\mathrm{new}}^\top \widehat{\boldsymbol{\beta}} - \varepsilon_{\mathrm{new}} \,\right|\, \mathbb{X} \sim \mathsf{N}\left( \mathbf{x}_{\mathrm{new}}^\top\boldsymbol{\beta} + 0,\quad \sigma^2 \mathbf{x}_{\mathrm{new}}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbf{x}_{\mathrm{new}} \;+\; \sigma^2 \right) \] and consequently \[ \left. \frac{ \mathbf{x}_{\mathrm{new}}^\top \widehat{\boldsymbol{\beta}} - Y_{\mathrm{new}} }{ \sqrt{ \widehat{\sigma}^2 \left(1 + \mathbf{x}_{\mathrm{new}}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbf{x}_{\mathrm{new}}\right)} } = \frac{ \mathbf{x}_{\mathrm{new}}^\top \widehat{\boldsymbol{\beta}} - \varepsilon_{\mathrm{new}} - \mathbf{x}_\mathrm{new}^\top\boldsymbol{\beta} }{ \sqrt{ \frac{\mathsf{RSS}}{n-p} \frac{\sigma^2}{\sigma^2} \left(1 + \mathbf{x}_{\mathrm{new}}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbf{x}_{\mathrm{new}}\right)} } \,\right|\, \mathbb{X} \sim \mathsf{t}_{n-p}. \] This result is used to construct the prediction intervals: \[ \mathsf{P} \left( \mathbf{x}_{\mathrm{new}}^\top \widehat{\boldsymbol{\beta}} \pm \mathsf{t}_{n-p}\left(1-\frac{\alpha}{2}\right)\sqrt{\widehat{\sigma}^2\left[1 + \mathbf{x}_{\mathrm{new}}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbf{x}_{\mathrm{new}}\right]} \;\ni Y_{\mathrm{new}} \right) = 1-\alpha. \] The prediction intervals are provided by predict function with the change of the interval parameter to prediction:

newdata <- data.frame(depth = c(3,7), group = c("CB-VJJ", "VJJ"))
predict(fit3, newdata = newdata, level = 0.95, interval = "confidence")

##        fit      lwr      upr
## 1 1.086413 1.030212 1.142614
## 2 1.125850 1.076717 1.174983

predict(fit3, newdata = newdata, level = 0.95, interval = "prediction")

##        fit       lwr      upr
## 1 1.086413 0.8127977 1.360029
## 2 1.125850 0.8535983 1.398102

The prediction intervals could be computed for wide range of \(\mathbf{x}_{\mathrm{new}}\) values. These prediction bands should cover \(100(1-\alpha)\) % of observations. First, let us visualize this with one of the groups:

g <- "CB-VJJ"
xgrid <- seq(0, 14, length.out = 101)
conf_g <- predict(fit3, newdata = data.frame(group = g, depth = xgrid),
                  interval = "confidence", level = 0.95)
pred_g <- predict(fit3, newdata = data.frame(group = g, depth = xgrid),
                  interval = "prediction", level = 0.95)

par(mfrow = c(1,1), mar = c(4,4,2,1))
plot(N ~ depth, data = peat[peat$group == g,],
     main = paste("Group", g),
     pch = PCH[g], col = COL[g], bg = BGC[g], 
     xlab = XLAB, ylab = YLAB, xlim = XLIM, ylim = YLIM) 
lines(xgrid, pred_g[,"fit"], col = COL[g], lwd = 2, lty = 1)
lines(xgrid, pred_g[,"lwr"], col = DARK[g], lwd = 2, lty = 2)
lines(xgrid, pred_g[,"upr"], col = DARK[g], lwd = 2, lty = 2)
lines(xgrid, conf_g[,"lwr"], col = "blue", lwd = 2, lty = 3)
lines(xgrid, conf_g[,"upr"], col = "blue", lwd = 2, lty = 3)
# Where is the band narrowest? 
abline(v = mean(peat$depth[peat$group == g]), col = "grey")
abline(h = mean(peat$N[peat$group == g]), col = "grey")

legend("topright", legend = c("Point estimator", "Prediction CI", "Confidence CI"),
       col = c(COL[g], DARK[g], "blue"), lwd = 2, lty = 1:3, bty = "n", ncol = 3)

And now with all groups together:

xgrid <- seq(0, 14, length.out = 101)
par(mfrow = c(1,1), mar = c(4,4,2,1))
plot(N ~ depth, data = peat,
     main = "Prediction bands for all groups",
     pch = PCH[group], col = COL[group], bg = BGC[group], 
     xlab = XLAB, ylab = YLAB, xlim = XLIM, ylim = YLIM) 
for(g in levels(peat$group)){
  conf_g <- predict(fit3, newdata = data.frame(group = g, depth = xgrid),
                    interval = "confidence", level = 0.95)
  pred_g <- predict(fit3, newdata = data.frame(group = g, depth = xgrid),
                    interval = "prediction", level = 0.95)
  
  polygon(c(xgrid, rev(xgrid)), c(pred_g[,"lwr"], rev(pred_g[,"upr"])),
          border = DARK[g], col = TRC[g], lty = 2)
  lines(xgrid, pred_g[,"fit"], col = COL[g], lwd = 2, lty = 1)
  
}

legend("topleft", legend = levels(peat$group),
       col = COL, lwd = 2, lty = 2, bty = "n", ncol = 2)

Simultaneous confidence region

Let us remind the distribution of \(\widehat{\boldsymbol{\beta}}\) under normality assumption in the full-rank model: \[ \left. \dfrac{\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}}{\sqrt{\sigma^2}} \,\right|\, \mathbb{X} \sim \mathsf{N}_p\left(\mathbf{0}_p, \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\right). \] Since the variance matrix is symmetric and invertible, we can apply Cholesky decomposition to get \(\left(\mathbb{X}^\top \mathbb{X}\right)^{-1} = \left(\mathbb{X}^\top \mathbb{X}\right)^{-\frac{1}{2}} \left(\left(\mathbb{X}^\top \mathbb{X}\right)^{-\frac{1}{2}}\right)^\top\) and find the inverse \(\left(\mathbb{X}^\top \mathbb{X}\right)^{\frac{1}{2}}\). Then, \[ \left. \dfrac{1}{\sqrt{\sigma^2}}\left(\mathbb{X}^\top \mathbb{X}\right)^{\frac{1}{2}} \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \,\right|\, \mathbb{X} \sim \mathsf{N}_p\left(\mathbf{0}_p, \left(\mathbb{X}^\top \mathbb{X}\right)^{\frac{1}{2}} \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \left(\left(\mathbb{X}^\top \mathbb{X}\right)^{\frac{1}{2}}\right)^\top\right) = \mathsf{N}_p\left(\mathbf{0}_p, \mathbb{I}_n\right) . \] Therefore, summing its squares yields \(\chi^2\) distribution: \[ \left. \frac{1}{\sigma^2} \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^\top \left(\mathbb{X}^\top \mathbb{X}\right) \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \,\right|\, \mathbb{X} \sim \chi_p^2, \] which is still independent of \(\mathsf{RSS}\). Hence, \[ \left. \dfrac{ \dfrac{ \frac{1}{\sigma^2} \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^\top \left(\mathbb{X}^\top \mathbb{X}\right) \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)\sim \chi_p^2 }{ p } }{ \dfrac{ \frac{1}{\sigma^2}\mathsf{RSS} \sim \chi_{n-p}^2 }{ (n-p) } } = \frac{1}{p} \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^\top \left[\frac{1}{\widehat{\sigma}^2} \mathbb{X}^\top \mathbb{X}\right] \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \,\right|\, \mathbb{X} \quad \sim \quad \mathsf{F}_{p,n-p}, \] which can be used for constructing confidence regions for simultaneous inference: \[ \mathsf{CR}(1-\alpha) = \left\{ \boldsymbol{\beta} \in \mathbb{R}^p: \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right)^\top \left[\frac{1}{\widehat{\sigma}^2} \mathbb{X}^\top \mathbb{X}\right] \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}\right) \leq p \,\mathsf{F}_{p,n-p}(1-\alpha) \right\}, \] which satisfies \(\mathsf{P}\left(\mathsf{CR}(1-\alpha) \ni \boldsymbol{\beta}_0\right) = 1-\alpha\) for the true value \(\boldsymbol{\beta}_0\). Clearly, \(\mathsf{CR}(1-\alpha)\) is an ellipsoid with

center \(\widehat{\boldsymbol{\beta}}\),
shape matrix \(\widehat{\sigma}^2 \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} = \mathsf{var}\,\left[ \widehat{\boldsymbol{\beta}} \mid \mathbb{X}\right]\),
diameter \(\sqrt{p \,\mathsf{F}_{p,n-p}(1-\alpha)}\).

In R we can use confidenceEllipse function (see ?ellipse for other usages) from package car which automatically extracts required parameters to plot an ellipse. However, only 2D ellipse could be plotted, so only two coefficients could be chosen at a time. Let’s compare the confidence regions (blue) with rectangles made by confidence intervals (red).

CIs <- confint(fit3)
par(mfrow = c(2,2), mar = c(4,4,0.5,0.5))

confidenceEllipse(fit3, which.coef = 1:2)
abline(v = CIs[1,], col = "red", lty = 2)
abline(h = CIs[2,], col = "red", lty = 2)

confidenceEllipse(fit3, which.coef = 3:4)
abline(v = CIs[3,], col = "red", lty = 2)
abline(h = CIs[4,], col = "red", lty = 2)

confidenceEllipse(fit3, which.coef = 5:6)
abline(v = CIs[5,], col = "red", lty = 2)
abline(h = CIs[6,], col = "red", lty = 2)

confidenceEllipse(fit3, which.coef = 7:8)
abline(v = CIs[7,], col = "red", lty = 2)
abline(h = CIs[8,], col = "red", lty = 2)

Confidence band FOR the regression function

The result from the previous part could be extended even further. We know that \[ \mathsf{P}\left[ \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right)^\top \left[\frac{1}{\widehat{\sigma}^2} \mathbb{X}^\top \mathbb{X}\right] \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right) \leq p \,\mathsf{F}_{p,n-p}(1-\alpha) \right] = 1-\alpha \] if \(\boldsymbol{\beta}_0\) is the true value of the parameter \(\boldsymbol{\beta}\). We wish to extend the result for any linear combination \(\mathbf{x}^\top\boldsymbol{\beta}\). Let us multiply both sides of the inequality with an estimate of the asymptotic variance from above: \(\widehat{\sigma}^2 \mathbf{x}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{x}\) for any \(\mathbf{x}\in\mathbb{R}^p\). For \(\mathbf{x} = \mathbf{0}\) the inequality trivially holds, for other the estimate of the variance has to be positive (from positive definiteness). The left hand side now can be rearranged with the use of trace: \[\begin{align*} \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right)^\top \mathbb{X}^\top \mathbb{X} \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right) \cdot \mathbf{x}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{x} &= \mathsf{Tr}\left[ \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right)^\top \mathbb{X}^\top \mathbb{X} \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right) \cdot \mathbf{x}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{x} \right] \\ &= \mathsf{Tr}\left[ \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right) \mathbf{x}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\left(\mathbb{X}^\top \mathbb{X}\right) \mathbf{x} \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right)^\top \right] \\ &= \mathsf{Tr}\left[ \mathbf{x}^\top\left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right) \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right)^\top \mathbf{x} \right] \\ &= \left[ \mathbf{x}^\top\left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0\right) \right]^2 \\ &= \left[ \mathbf{x}^\top\widehat{\boldsymbol{\beta}} - \mathbf{x}^\top\boldsymbol{\beta}_0 \right]^2 \end{align*}\] Therefore, \[\begin{align*} \mathsf{P}\left[ \forall \mathbf{x}\in\mathbb{R}^p: \left[ \mathbf{x}^\top\widehat{\boldsymbol{\beta}} - \mathbf{x}^\top\boldsymbol{\beta}_0 \right]^2 \leq p \,\mathsf{F}_{p,n-p}(1-\alpha) \, \widehat{\sigma}^2 \mathbf{x}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{x} \right] &= 1-\alpha \\ \mathsf{P}\left[ \forall \mathbf{x}\in\mathbb{R}^p: \left| \mathbf{x}^\top\widehat{\boldsymbol{\beta}} - \mathbf{x}^\top\boldsymbol{\beta}_0 \right| \leq \sqrt{p \,\mathsf{F}_{p,n-p}(1-\alpha) \, \widehat{\sigma}^2 \mathbf{x}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{x}} \right] &= 1-\alpha \\ \mathsf{P}\left[ \forall \mathbf{x}\in\mathbb{R}^p: \mathbf{x}^\top\widehat{\boldsymbol{\beta}} \pm \sqrt{p \,\mathsf{F}_{p,n-p}(1-\alpha) \, \widehat{\sigma}^2 \mathbf{x}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{x}} \quad \ni \mathbf{x}^\top\boldsymbol{\beta}_0 \right] &= 1-\alpha. \end{align*}\] This will be called confidence band FOR the regression function.

On the other hand we have confidence band AROUND the regression function which is based on pointwise coverage provided by the result for any \(\theta = \mathbf{c}^\top \boldsymbol{\beta}\): \[ \mathsf{P}\left[ \mathbf{c}^\top\widehat{\boldsymbol{\beta}} \pm \mathsf{t}_{n-p}\left(1-\frac{\alpha}{2}\right) \, \sqrt{\widehat{\sigma}^2 \mathbf{c}^\top \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}\mathbf{c}} \quad \ni \mathbf{c}^\top\boldsymbol{\beta}_0 \right] = 1-\alpha. \] Since \(\mathsf{t}_{n-p}^2 = \mathsf{F}_{1, n-p}\), the difference in used quantiles is: \[ \mathrm{AROUND}: \mathsf{t}_{n-p}\left(1-\frac{\alpha}{2}\right) =\sqrt{\mathsf{F}_{1, n-p}(1-\alpha)} \qquad \text{vs.} \qquad \mathrm{FOR}: \sqrt{p\,\mathsf{F}_{p, n-p}(1-\alpha)}, \] where the second one is always larger. In our case we have

(qaround <- qt(0.975, df = fit3$df.residual))

## [1] 1.977054

sqrt(qf(0.95, df1 = 1, df2 = fit3$df.residual))

## [1] 1.977054

(qfor <- sqrt(fit3$rank * qf(0.95, df1 = fit3$rank, df2 = fit3$df.residual)))

## [1] 4.005126

Let us compute the bands for one of the groups first:

g <- "CB-VJJ"
xgrid <- seq(0, 14, length.out = 101)
band_around_g <- predict(fit3, newdata = data.frame(group = g, depth = xgrid),
                         interval = "confidence", level = 0.95, se.fit = TRUE)
pred_g <- predict(fit3, newdata = data.frame(group = g, depth = xgrid),
                  interval = "prediction", level = 0.95)
# Manual computations: confidence band FOR regression function (not provided by predict!)
band_for_g <- data.frame(fit = band_around_g$fit[,"fit"],
                         lwr = band_around_g$fit[,"fit"] - qfor * band_around_g$se.fit,
                         upr = band_around_g$fit[,"fit"] + qfor * band_around_g$se.fit)

# Plot the bands:
par(mfrow = c(1,1), mar = c(4,4,2,1))
plot(N ~ depth, data = peat[peat$group == g,],
     main = paste("Group", g),
     pch = PCH[g], col = COL[g], bg = BGC[g], 
     xlab = XLAB, ylab = YLAB, xlim = XLIM, ylim = YLIM) 
lines(xgrid, pred_g[,"fit"], col = COL[g], lwd = 2, lty = 1)
lines(xgrid, pred_g[,"lwr"], col = DARK[g], lwd = 2, lty = 2)
lines(xgrid, pred_g[,"upr"], col = DARK[g], lwd = 2, lty = 2)
lines(xgrid, band_around_g$fit[,"lwr"], col = "blue", lwd = 2, lty = 3)
lines(xgrid, band_around_g$fit[,"upr"], col = "blue", lwd = 2, lty = 3)
lines(xgrid, band_for_g[,"lwr"], col = "red", lwd = 2, lty = 4)
lines(xgrid, band_for_g[,"upr"], col = "red", lwd = 2, lty = 4)
# Where is the band narrowest? 
abline(v = mean(peat$depth[peat$group == g]), col = "grey")
abline(h = mean(peat$N[peat$group == g]), col = "grey")

legend("topright", legend = c("Point estimator", "Prediction band", "Band AROUND", "Band FOR"),
       col = c(COL[g], DARK[g], "blue", "red"), lwd = 2, lty = 1:4, bty = "n", ncol = 2)

However, the band FOR the regression function holds for any group. The same model gives us several regression lines each with its own set of bands:

xgrid <- seq(0, 14, length.out = 101)
band_around <- pred <- band_for <- list()
par(mfrow = c(2,2), mar = c(4,4,2,1))
for(g in levels(peat$group)){
  band_around[[g]] <- predict(fit3, newdata = data.frame(group = g, depth = xgrid),
                              interval = "confidence", level = 0.95, se.fit = TRUE)
  pred[[g]] <- predict(fit3, newdata = data.frame(group = g, depth = xgrid),
                       interval = "prediction", level = 0.95)
  # Manual computations: confidence band FOR regression function (not provided by predict!)
  band_for[[g]] <- data.frame(fit = band_around[[g]]$fit[,"fit"],
                              lwr = band_around[[g]]$fit[,"fit"] - qfor * band_around[[g]]$se.fit,
                              upr = band_around[[g]]$fit[,"fit"] + qfor * band_around[[g]]$se.fit)
  
  # Plot the bands:
  plot(N ~ depth, data = peat[peat$group == g,],
       main = paste("Group", g),
       pch = PCH[g], col = COL[g], bg = BGC[g], 
       xlab = XLAB, ylab = YLAB, xlim = XLIM, ylim = YLIM) 
  lines(xgrid, pred[[g]][,"fit"], col = COL[g], lwd = 2, lty = 1)
  lines(xgrid, pred[[g]][,"lwr"], col = DARK[g], lwd = 2, lty = 2)
  lines(xgrid, pred[[g]][,"upr"], col = DARK[g], lwd = 2, lty = 2)
  lines(xgrid, band_around[[g]]$fit[,"lwr"], col = "blue", lwd = 2, lty = 3)
  lines(xgrid, band_around[[g]]$fit[,"upr"], col = "blue", lwd = 2, lty = 3)
  lines(xgrid, band_for[[g]][,"lwr"], col = "red", lwd = 2, lty = 4)
  lines(xgrid, band_for[[g]][,"upr"], col = "red", lwd = 2, lty = 4)
}