NMSA407 Linear Regression: Tutorial

Matrix algebra background of linear regression

Dataset

Everything in this tutorial will be exemplified on pseudodata that we now create.

y <- c(-9, -11, 1, 19)
x <- c(-3, -1, 1, 3)
x2 <- c(9, 1, 1, 9)
x2 <- x^2                     ### the same
X <- cbind(1, x, x2)
colnames(X) <- c("intcpt", "x", "x^2")
print(X)

##      intcpt  x x^2
## [1,]      1 -3   9
## [2,]      1 -1   1
## [3,]      1  1   1
## [4,]      1  3   9

Data <- data.frame(y = y, x = x, x2 = x2)
print(Data)

##     y  x x2
## 1  -9 -3  9
## 2 -11 -1  1
## 3   1  1  1
## 4  19  3  9

Linear model

We will be fitting the following linear model: \[ \mathsf{E}Y = \beta_0 + \beta_1\,x + \beta_2\,x^2. \]

Raw solution to normal equations

First, we calculate solution to normal equations directly from theoretically derived expressions, i.e., as \(\widehat{\beta} = \bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\mathbf{X}^\top\mathbf{Y}\).

Matrix \(\mathbf{X}^\top\mathbf{X}\)

XtX <- t(X) %*% X
XtX <- crossprod(X)    ## the same
print(XtX)

##        intcpt  x x^2
## intcpt      4  0  20
## x           0 20   0
## x^2        20  0 164

Matrix \(\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\)

invXtX <- solve(XtX)
print(invXtX)

##           intcpt    x       x^2
## intcpt  0.640625 0.00 -0.078125
## x       0.000000 0.05  0.000000
## x^2    -0.078125 0.00  0.015625

Solution to normal equations using already calculated inverse

b <- invXtX %*% t(X) %*% y
print(b)

##         [,1]
## intcpt -6.25
## x       4.80
## x^2     1.25

Solution to normal equations calculated directly as a solution to a linear system

solve(XtX, t(X) %*% y)

##         [,1]
## intcpt -6.25
## x       4.80
## x^2     1.25

solve(crossprod(X), crossprod(X, y))   ## the same

##         [,1]
## intcpt -6.25
## x       4.80
## x^2     1.25

Linear algebra and solution to normal equations

Calculation of the solution to normal equations using the above “raw” approach is numerically not too optimal. In practice, numerically more stable and much better methods are used which are based on decompositions of the model matrix \(\mathbf{X}\) known from linear algebra. The advantage of these methods is also the fact that once the decomposition of the model matrix \(\mathbf{X}\) is available, calculation of all the rest (fitted values, residuals, solution to normal equations, …) is trivial (and also computationally fast). This is in particular useful in situations when the least squares solutions are needed for a series of “response vectors” \(\mathbf{Y}\) while keeping the model matrix \(\mathbf{X}\) fixed.

The most often used decompositions are

QR decomposition (used also by the lm function in R);
singular value decomposition (SVD) which is also useful for calculation of the Moore-Penrose pseudoinverse matrix.

In the following, the above two decompositions will be explored in more detail. Nevertheless first, we explore other two decompositions of a symmetric positive semidefinite matrix \(\mathbf{X}^\top\mathbf{X}\) that can also be used to solve the normal equations \(\mathbf{X}^\top\mathbf{X}\,\mathbf{b}= \mathbf{X}^\top\mathbf{Y}\) and possibly for some other tasks.

These will be

spectral decomposition;
Cholesky (square root) decomposition.

Both the spectral decomposition and the Cholesky decomposition should be known from the bachelor courses Linear algebra and Fundamentals of Numerical Mathematics, Cholesky decomposition perhaps in a more general form of the LU decomposition.

Spectral decomposition of a symmetric positive semidefinite matrix \(\mathbf{X}^\top\mathbf{X}\)

\[ \mathbf{X}^\top\mathbf{X} = \mathbf{V}\,\Lambda\,\mathbf{V}^\top = \sum_{j=1}^p \lambda_j\,\mathbf{v}_j\,\mathbf{v}_j^\top, \qquad \lambda_j \geq 0,\;\;j=1,\ldots,p. \]

Columns of matrix \(\mathbf{V}\): orthonormal eigenvectors \(\mathbf{v}_1,\ldots,\mathbf{v}_p\) of a matrix \(\mathbf{X}^\top\mathbf{X}\).
Matrix \(\Lambda\): diagonal matrix with eigenvalues \(\lambda_1, \ldots, \lambda_p\) on a diagonal.

It is additionally satisfied: \[ \mathbf{V}\,\mathbf{V}^\top = \sum_{j=1}^p \mathbf{v}_j\,\mathbf{v}_j^\top = \mathbf{I}_p, \] \[ \mathbf{V}^\top\,\mathbf{V} = \bigl(\mathbf{v}_j^\top\,\mathbf{v}_l\bigr) = \mathbf{I}_p. \] That is, \[ \mathbf{V}^{-1} = \mathbf{V}^\top. \] If \(\lambda_j > 0\) for all \(j=1,\ldots,p\) then \[ \bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1} = \sum_{j=1}^p \lambda_j^{-1}\,\mathbf{v}_j\,\mathbf{v}_j^\top = \mathbf{V}\,\Lambda^{-1}\,\mathbf{V}^\top. \]

Spectral decomposition of \(\mathbf{X}^\top\mathbf{X}\)

eigen(XtX)

## $values
## [1] 166.462113  20.000000   1.537887
## 
## $vectors
##            [,1] [,2]       [,3]
## [1,] -0.1221833    0  0.9925076
## [2,]  0.0000000   -1  0.0000000
## [3,] -0.9925076    0 -0.1221833

Elements of an object returned by the `eigen` function

e <- eigen(XtX)
names(e)

## [1] "values"  "vectors"

Does it satisfy what it should satisfy?

V <- e$vectors
Lambda <- diag(e$values)

V %*% Lambda %*% t(V)

##      [,1] [,2] [,3]
## [1,]    4    0   20
## [2,]    0   20    0
## [3,]   20    0  164

round(V %*% t(V), 13)             ## only numerically identity matrix

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

round(t(V) %*% V, 13)             ## only numerically identity matrix

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

round(XtX %*% V - V %*% Lambda, 13)    ## X'X * V = V * Lambda

##        [,1] [,2] [,3]
## intcpt    0    0    0
## x         0    0    0
## x^2       0    0    0

invLambda <- diag(1 / e$values)
V %*% invLambda %*% t(V)

##           [,1] [,2]      [,3]
## [1,]  0.640625 0.00 -0.078125
## [2,]  0.000000 0.05  0.000000
## [3,] -0.078125 0.00  0.015625

round(V %*% invLambda %*% t(V) - invXtX, 13)  ## V * Lambda^(-1) * V' = (X'X)^(-1)

##        intcpt x x^2
## intcpt      0 0   0
## x           0 0   0
## x^2         0 0   0

Cholesky (square root) decomposition of a symmetric positive semidefinite matrix \(\mathbf{X}^\top\mathbf{X}\)

\[ \mathbf{X}^\top\mathbf{X} = \mathbf{L}\,\mathbf{L}^\top, \] where \(\mathbf{L}\) is a lower triangular matrix.

This is a special case of so called LU decomposition where in the case of a symmetric positive semidefinite matrix \(\mathbf{U} = \mathbf{L}^\top\) (or \(\mathbf{L} = \mathbf{U}^\top\)).

In statistics, symmetric positive semidefinite matrices are quite often faced with as all covariance matrices are of this type. The Cholesky decomposition is in fact a generalization of a square root to matrices (as will immediately be shown). The Cholesky decomposition of a covariance matrix is then in fact a multivariate version of a standard deviation. Its calculation then simplifies many problems.

If \(\mathbf{D}\) is a symmetric positive semidefinite matrix (e.g., a covariance matrix or a matrix \(\mathbf{X}^\top\mathbf{X}\) in a regression context) then it can be decomposed as \(\mathbf{D} = \mathbf{L}\,\mathbf{L}^\top\), where \(\mathbf{L}\) is lower triangular matrix. Among other things, the following tasks are practically trivial when \(\mathbf{L}\) is already available.

Linear systems \(\mathbf{D}\,\mathbf{b} = \mathbf{c}\) are easily solved by backward and forward substitution:
- \(\mathbf{L}\,\mathbf{L}^\top\,\mathbf{b} = \mathbf{c}\),
- first solve \(\mathbf{L}\,\mathbf{v} = \mathbf{c}\) (forward substitution),
- then solve \(\mathbf{L}^\top\,\mathbf{b} = \mathbf{v}\) (backward substitution).
This is especially useful when the linear system \(\mathbf{D}\,\mathbf{b} = \mathbf{c}\) is to be solved for (many) different right-hand-side vectors \(\mathbf{c}\).
\(\mathsf{det}(\mathbf{D})\) \(= \mathsf{det}\bigl(\mathbf{L}\,\mathbf{L}^\top\bigr)\) \(= \mathsf{det}(\mathbf{L})\,\mathsf{det}(\mathbf{L})\) \(= (\mbox{product of diagonal elements of }\mathbf{L})^2\).
\(\mathbf{D}^{-1}\) \(= \bigl(\mathbf{L}\,\mathbf{L}^\top\bigr)^{-1}\) \(= \bigl(\mathbf{L}^{-1}\bigr)^\top\,\mathbf{L}^{-1}\). That is, to calculate \(\mathbf{D}^{-1}\), it is only needed to invert a triangular matrix (which is easy).
If a random vector \(\mathbf{Z}\) has a distribution with a unit covariance matrix then \(\mathbf{L}\,\mathbf{Z}\) has a distribution with a covariance matrix \(\mathbf{D}\). This is useful for random numbers generation.

Cholesky decomposition of a matrix \(\mathbf{X}^\top\mathbf{X}\)

R returns matrix \(\mathbf{U} = \mathbf{L}^\top\).

chol(XtX)

##        intcpt        x x^2
## intcpt      2 0.000000  10
## x           0 4.472136   0
## x^2         0 0.000000   8

Does it satisfy what it should satisfy?

U <- chol(XtX)
t(U) %*% U

##        intcpt  x x^2
## intcpt      4  0  20
## x           0 20   0
## x^2        20  0 164

Calculate \(\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\) from the Cholesky decomposition

chol2inv(U)

##           [,1] [,2]      [,3]
## [1,]  0.640625 0.00 -0.078125
## [2,]  0.000000 0.05  0.000000
## [3,] -0.078125 0.00  0.015625

invXtX2 <- chol2inv(U)
round(invXtX2 - invXtX, 13)      ## compare with previously calculated (X'X)^(-1), numerically the same

##        intcpt x x^2
## intcpt      0 0   0
## x           0 0   0
## x^2         0 0   0

Determinant of \(\mathbf{X}^\top\mathbf{X}\)

prod(diag(U))^2

## [1] 5120

det(XtX)                         ## R function to calculate determinant

## [1] 5120

Solution to normal equations via backward and forward substitution using the Cholesky decomposition of a matrix \(\mathbf{X}^\top\mathbf{X}\)

We need to solve \(\mathbf{X}^\top\mathbf{X}\,\mathbf{b} = \mathbf{X}^\top\mathbf{Y}\). Since \(\mathbf{X}^\top\mathbf{X} = \mathbf{U}^\top\mathbf{U}\), we have to solve \(\mathbf{U}^\top\mathbf{U}\,\mathbf{b} = \mathbf{X}^\top\mathbf{Y}\).

(v <- forwardsolve(t(U), crossprod(X, y)))     ## solution to U' * v = X'Y

##          [,1]
## [1,]  0.00000
## [2,] 21.46625
## [3,] 10.00000

(bchol <- backsolve(U, v))                     ## solution to U * b  = v

##       [,1]
## [1,] -6.25
## [2,]  4.80
## [3,]  1.25

print(b)      ## compare to previously calculated solution

##         [,1]
## intcpt -6.25
## x       4.80
## x^2     1.25

Singular value decomposition (SVD) of the model matrix \(\mathbf{X}\)

Suppose that \(n = \mathrm{nrow}(\mathbf{X})\), \(k = \mathrm{ncol}(\mathbf{X})\), \(r = \mathrm{rank}(\mathbf{X}) \leq k \leq n\). The singular value decomposition (SVD) of the model matrix \(\mathbf{X}\) takes the following form: \[ \mathbf{X} = \mathbf{U}\,\mathbf{D}\,\mathbf{V}^\top, \] where

Columns of a matrix \(\mathbf{U}_{n\times k}\): \(\mathbf{u}_1,\ldots,\mathbf{u}_k\) \(=\) the first \(k\) orthonormal eigenvectors of a matrix \(\mathbf{X}\,\mathbf{X}^\top\);
Columns of a matrix \(\mathbf{V}_{k\times k}\): \(\mathbf{v}_1,\ldots,\mathbf{v}_k\) \(=\) (all) orthonormal eigenvectors of a matrix \(\mathbf{X}^\top\,\mathbf{X}\);
Matrix \(\mathbf{D}_{k\times k}\): diagonal matrix with \(d_1, \ldots, d_k\) on a diagonal
- \(d_j^2\), \(j=1,\ldots,k\): the first \(k\) eigenvalues of a matrix \(\mathbf{X}\,\mathbf{X}^\top\), which are the same as (all) eigenvalues of a matrix \(\mathbf{X}^\top\,\mathbf{X}\).

Numbers \(d_1,\ldots,d_k\) are called singular values of the matrix \(\mathbf{X}\). It holds:

\(r = k\): \(d_1 \geq d_2 \geq \cdots \geq d_k > 0\),
\(r < k\): \(d_1 \geq d_2 \geq \cdots \geq d_r > d_{r+1} = \cdots = d_k = 0\).

Further,

\(\mathbf{U}^\top\,\mathbf{U} = \bigl(\mathbf{u}_i^\top\mathbf{u}_j\bigr) = \mathbf{I}_{k\times k}\);
\(\mathbf{V}^\top\,\mathbf{V} = \bigl(\mathbf{v}_i^\top\mathbf{v}_j\bigr) = \mathbf{I}_{k\times k}\);
\(\mathbf{V}\,\mathbf{V}^\top = \sum_{j=1}^k\mathbf{v}_j\,\mathbf{v}_j^\top = \mathbf{I}_{k\times k}\).
\(\mathbf{X} = \mathbf{U}\,\mathbf{D}\,\mathbf{V}^\top = \sum_{j=1}^k d_j\,\mathbf{u}_j\,\mathbf{v}_j^\top\);
\(\mathbf{X}^\top\,\mathbf{X} = \mathbf{U}\,\mathbf{D}\,\mathbf{D}\,\mathbf{V}^\top = \sum_{j=1}^k d_j^2\,\mathbf{u}_j\,\mathbf{v}_j^\top\).

If for all \(j=1,\ldots,k\), \(d_j > 0\), matrix \(\mathbf{X}^\top\,\mathbf{X}\) is invertible and \[ \bigl(\mathbf{X}^\top\,\mathbf{X}\bigr)^{-1} = \mathbf{U}\,\mathbf{D}^{-2}\,\mathbf{V}^\top = \sum_{j=1}^k \frac{1}{d_j^2}\,\mathbf{u}_j\,\mathbf{v}_j^\top, \] where \(\mathbf{D}^{-2}\) is a diagonal matrix with \(d_1^{-2},\ldots,d_k^{-2}\) on a diagonal.

Since matrix \(\mathbf{U}\) contains an orthonormal basis of \(M(\mathbf{X})\) in its columns, the projection matrix into \(M(\mathbf{X})\) is \[ \mathbf{H} = \mathbf{U}\,\mathbf{U}^\top. \]

The SVD decomposition can also be used to calculate the Moore-Penrose pseudoinverse matrix to \(\mathbf{X}\): \[ \mathbf{X}^- = \tilde{\mathbf{V}}\,{\tilde{\mathbf{D}}}^{-1}\,{\tilde{\mathbf{U}}}^\top, \] where matrices \(\tilde{\mathbf{V}}_{k\times r},\,\tilde{\mathbf{D}}_{r\times r},\,\tilde{\mathbf{U}}_{n\times r}\) contains/corresponds only to positive singular values.

Singular value decomposition of a matrix \(\mathbf{X}\)

svd(X)

## $d
## [1] 12.902020  4.472136  1.240116
## 
## $u
##            [,1]       [,2]        [,3]
## [1,] 0.70180882  0.6708204 -0.08639661
## [2,] 0.08639661  0.2236068  0.70180882
## [3,] 0.08639661 -0.2236068  0.70180882
## [4,] 0.70180882 -0.6708204 -0.08639661
## 
## $v
##           [,1] [,2]       [,3]
## [1,] 0.1221833    0  0.9925076
## [2,] 0.0000000   -1  0.0000000
## [3,] 0.9925076    0 -0.1221833

Content of an object returned by the `svd` function

svdX <- svd(X)
names(svdX)

## [1] "d" "u" "v"

(U <- svdX$u)

##            [,1]       [,2]        [,3]
## [1,] 0.70180882  0.6708204 -0.08639661
## [2,] 0.08639661  0.2236068  0.70180882
## [3,] 0.08639661 -0.2236068  0.70180882
## [4,] 0.70180882 -0.6708204 -0.08639661

(D <- diag(svdX$d))

##          [,1]     [,2]     [,3]
## [1,] 12.90202 0.000000 0.000000
## [2,]  0.00000 4.472136 0.000000
## [3,]  0.00000 0.000000 1.240116

(V <- svdX$v)

##           [,1] [,2]       [,3]
## [1,] 0.1221833    0  0.9925076
## [2,] 0.0000000   -1  0.0000000
## [3,] 0.9925076    0 -0.1221833

Matrices \(\mathbf{D}^2,\,\mathbf{D}^{-1},\,\mathbf{D}^{-2}\)

(D2  <- diag(svdX$d^2))       ## = D * D

##          [,1] [,2]     [,3]
## [1,] 166.4621    0 0.000000
## [2,]   0.0000   20 0.000000
## [3,]   0.0000    0 1.537887

(iD  <- diag(1 / svdX$d))     ## = D^(-1)

##            [,1]      [,2]      [,3]
## [1,] 0.07750724 0.0000000 0.0000000
## [2,] 0.00000000 0.2236068 0.0000000
## [3,] 0.00000000 0.0000000 0.8063762

(iD2 <- diag(1 / svdX$d^2))   ## = (D * D)^(-1)

##             [,1] [,2]      [,3]
## [1,] 0.006007373 0.00 0.0000000
## [2,] 0.000000000 0.05 0.0000000
## [3,] 0.000000000 0.00 0.6502426

Do the above matrices satisfy what they should satisfy?

U %*% D %*% t(V)              ### --> X

##      [,1] [,2] [,3]
## [1,]    1   -3    9
## [2,]    1   -1    1
## [3,]    1    1    1
## [4,]    1    3    9

V %*% D %*% t(D) %*% t(V)     ### --> X'X

##      [,1] [,2] [,3]
## [1,]    4    0   20
## [2,]    0   20    0
## [3,]   20    0  164

round(t(U) %*% U, 13)         ### --> I (numerically)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

round(t(V) %*% V, 13)         ### --> I (numerically)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

round(V %*% V, 13)            ### --> I (numerically)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

Moore-Penrose pseudoinverse of a matrix \(\mathbf{X}\)

Xmin <- V %*% diag(1/svdX$d) %*% t(U)
print(Xmin)

##         [,1]    [,2]    [,3]    [,4]
## [1,] -0.0625  0.5625  0.5625 -0.0625
## [2,] -0.1500 -0.0500  0.0500  0.1500
## [3,]  0.0625 -0.0625 -0.0625  0.0625

Properties of the Moore-Penrose pseudoinverse

round(X %*% Xmin %*% X - X, 13)                    ## X * X^- * X = X

##      intcpt x x^2
## [1,]      0 0   0
## [2,]      0 0   0
## [3,]      0 0   0
## [4,]      0 0   0

round(Xmin %*% X %*% Xmin - Xmin, 13)              ## X^- * X * X^- = X^-

##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## [3,]    0    0    0    0

round(t(X %*% Xmin) - X %*% Xmin, 13)              ## (X * X^-)' = X * X^-

##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## [3,]    0    0    0    0
## [4,]    0    0    0    0

round(t(Xmin %*% X) - Xmin %*% X, 13)              ## (X^- * X)' = X^- * X

##        [,1] [,2] [,3]
## intcpt    0    0    0
## x         0    0    0
## x^2       0    0    0

\(\mathbf{H}\) matrix to project into \(M(\mathbf{X})\) (regression space)

(H <- U %*% t(U))

##       [,1]  [,2]  [,3]  [,4]
## [1,]  0.95  0.15 -0.15  0.05
## [2,]  0.15  0.55  0.45 -0.15
## [3,] -0.15  0.45  0.55  0.15
## [4,]  0.05 -0.15  0.15  0.95

\(\mathbf{M}\) matrix to project into the orthogonal complement of \(M(\mathbf{X})\) (residual space)

(M <- diag(nrow(X)) - H)

##       [,1]  [,2]  [,3]  [,4]
## [1,]  0.05 -0.15  0.15 -0.05
## [2,] -0.15  0.45 -0.45  0.15
## [3,]  0.15 -0.45  0.45 -0.15
## [4,] -0.05  0.15 -0.15  0.05

Fitted values and residuals

(yhat <- H %*% y)

##      [,1]
## [1,] -9.4
## [2,] -9.8
## [3,] -0.2
## [4,] 19.4

(u <- M %*% y)

##      [,1]
## [1,]  0.4
## [2,] -1.2
## [3,]  1.2
## [4,] -0.4

Fitted values and residuals using the solution to normal equations (to compare)

(yhat_from_b <- X %*% b)

##      [,1]
## [1,] -9.4
## [2,] -9.8
## [3,] -0.2
## [4,] 19.4

(u_from_b <- y - yhat_from_b)

##      [,1]
## [1,]  0.4
## [2,] -1.2
## [3,]  1.2
## [4,] -0.4

Matrix \(\mathbf{X}^\top\mathbf{X}\)

V %*% D2 %*% t(V)         ## X'X

##      [,1] [,2] [,3]
## [1,]    4    0   20
## [2,]    0   20    0
## [3,]   20    0  164

crossprod(D %*% t(V))     ## X'X calculated even in a better way

##      [,1] [,2] [,3]
## [1,]    4    0   20
## [2,]    0   20    0
## [3,]   20    0  164

print(XtX)                ## previously calculated X'X

##        intcpt  x x^2
## intcpt      4  0  20
## x           0 20   0
## x^2        20  0 164

Matrix \(\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\)

V %*% iD2 %*% t(V)        ## (X'X)^(-1)

##           [,1] [,2]      [,3]
## [1,]  0.640625 0.00 -0.078125
## [2,]  0.000000 0.05  0.000000
## [3,] -0.078125 0.00  0.015625

crossprod(iD %*% t(V))    ## (X'X)^(-1) calculated even in a better way

##           [,1] [,2]      [,3]
## [1,]  0.640625 0.00 -0.078125
## [2,]  0.000000 0.05  0.000000
## [3,] -0.078125 0.00  0.015625

print(invXtX)             ## previously calculated X'X

##           intcpt    x       x^2
## intcpt  0.640625 0.00 -0.078125
## x       0.000000 0.05  0.000000
## x^2    -0.078125 0.00  0.015625

QR decomposition of the model matrix \(\mathbf{X}\)

Still suppose that \(n = \mathrm{nrow}(\mathbf{X})\), \(k = \mathrm{ncol}(\mathbf{X})\), \(r = \mathrm{rank}(\mathbf{X}) \leq k \leq n\). The QR decomposition of the model matrix \(\mathbf{X}\) takes the following form: \[ \mathbf{X} = \mathbf{Q}\,\mathbf{R}, \] where

Columns of a matrix \(\mathbf{Q}_{n\times k}\): \(\mathbf{q}_1,\ldots,\mathbf{q}_k\);
- the first \(r\) of these columns: orthonormal basis of a linear span of columns of \(\mathbf{X}\) (regression space \(M(\mathbf{X})\));
Matrix \(\mathbf{R}_{k\times k}\): upper triangular matrix.

Further, \[ \mathbf{X}^\top\mathbf{X} = \mathbf{R}^\top\,\mathbf{Q}^\top\mathbf{Q}\,\mathbf{R} = \mathbf{R}^\top\mathbf{R}, \] since the columns of the matrix \(\mathbf{Q}\) form an orthonormal basis and hence \(\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}_k\).

System of normal equations is then easily solvable by one backward substition since: \[ \mathbf{X}^\top\mathbf{X}\,\mathbf{b} = \mathbf{X}^\top\mathbf{Y} \] \[ \mathbf{R}^\top\mathbf{R}\,\mathbf{b} = \mathbf{R}^\top\mathbf{Q}^\top\,\mathbf{Y} \] \[ \mathbf{R}\,\mathbf{b} = \mathbf{Q}^\top\,\mathbf{Y}. \] That is, it is only requested to solve one linear system with an upper triangular matrix.

If \(r = k\) then \(\mathbf{X}^\top\mathbf{X}\) is invertible and \[ \bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1} = \bigl(\mathbf{R}^\top\mathbf{R}\bigr)^{-1} = \mathbf{R}^{-1}\bigl(\mathbf{R}^{-1}\bigr)^\top. \] That is, it is only needed to invert a triangular matrix (which is easy).

QR decomposition of a matrix \(\mathbf{X}\)

qr(X)

## $qr
##      intcpt          x         x^2
## [1,]   -2.0  0.0000000 -10.0000000
## [2,]    0.5 -4.4721360   0.0000000
## [3,]    0.5  0.4472136   8.0000000
## [4,]    0.5  0.8944272  -0.9296181
## 
## $rank
## [1] 3
## 
## $qraux
## [1] 1.500000 1.000000 1.368524
## 
## $pivot
## [1] 1 2 3
## 
## attr(,"class")
## [1] "qr"

Note that results are stored and shown in a compact form.

Content of an object returned by the `qr` function

qrX <- qr(X)
names(qrX)

## [1] "qr"    "rank"  "qraux" "pivot"

Rank of the model matrix

qrX$rank

## [1] 3

r <- qrX$rank
n <- nrow(X)

\(\mathbf{Q}\) matrix

In our case, \(r = k = 3\), all columns of \(\mathbf{Q}\) form an orthonormal basis of \(M(\mathbf{X})\).

qr.Q(qrX)

##      [,1]       [,2] [,3]
## [1,] -0.5  0.6708204  0.5
## [2,] -0.5  0.2236068 -0.5
## [3,] -0.5 -0.2236068 -0.5
## [4,] -0.5 -0.6708204  0.5

Q <- qr.Q(qrX)

\(\mathbf{R}\) matrix

qr.R(qrX)

##      intcpt         x x^2
## [1,]     -2  0.000000 -10
## [2,]      0 -4.472136   0
## [3,]      0  0.000000   8

R <- qr.R(qrX)

Is it really \(\mathbf{X} = \mathbf{Q}\,\mathbf{R}\)?

Q %*% R - X                ## not really zeros

##      intcpt x          x^2
## [1,]      0 0 0.000000e+00
## [2,]      0 0 0.000000e+00
## [3,]      0 0 8.881784e-16
## [4,]      0 0 0.000000e+00

round(Q %*% R - X, 14)     ## numerically yes

##      intcpt x x^2
## [1,]      0 0   0
## [2,]      0 0   0
## [3,]      0 0   0
## [4,]      0 0   0

\(\mathbf{Q}\) matrix complemented by \(n - r\) columns to form an orthonormal basis of \(R^n\) (in our case \(R^4\))

The orthonormal basis of \(R^n\) will be stored in a matrix \(\mathbf{P}\).

(P <- qr.Q(qrX, complete = TRUE))

##      [,1]       [,2] [,3]       [,4]
## [1,] -0.5  0.6708204  0.5  0.2236068
## [2,] -0.5  0.2236068 -0.5 -0.6708204
## [3,] -0.5 -0.2236068 -0.5  0.6708204
## [4,] -0.5 -0.6708204  0.5 -0.2236068

Orthonormal basis of the residual space (orthogonal complement to \(M(\mathbf{X})\))

The last \(n-r\) columns of the matrix \(\mathbf{P}\) form an orthonormal basis of the residual space. We store this basis in a matrix \(\mathbf{N}\).

(N <- P[, (r + 1):n])

## [1]  0.2236068 -0.6708204  0.6708204 -0.2236068

Projection matrix \(\mathbf{H}\) into the regression space \(M(\mathbf{X})\)

(H <- Q %*% t(Q))

##       [,1]  [,2]  [,3]  [,4]
## [1,]  0.95  0.15 -0.15  0.05
## [2,]  0.15  0.55  0.45 -0.15
## [3,] -0.15  0.45  0.55  0.15
## [4,]  0.05 -0.15  0.15  0.95

Projection matrix \(\mathbf{M}\) into the residual space (orthogonal complement to \(M(\mathbf{X})\))

(M <- N %*% t(N))

##       [,1]  [,2]  [,3]  [,4]
## [1,]  0.05 -0.15  0.15 -0.05
## [2,] -0.15  0.45 -0.45  0.15
## [3,]  0.15 -0.45  0.45 -0.15
## [4,] -0.05  0.15 -0.15  0.05

Fitted values and residuals

(yhat <- H %*% y)

##      [,1]
## [1,] -9.4
## [2,] -9.8
## [3,] -0.2
## [4,] 19.4

(u <- M %*% y)

##      [,1]
## [1,]  0.4
## [2,] -1.2
## [3,]  1.2
## [4,] -0.4

Fitted values and residuals calculated using the solution to normal equations

(yhat_from_b <- X %*% b)

##      [,1]
## [1,] -9.4
## [2,] -9.8
## [3,] -0.2
## [4,] 19.4

(u_from_b <- y - yhat_from_b)

##      [,1]
## [1,]  0.4
## [2,] -1.2
## [3,]  1.2
## [4,] -0.4

Matrix \(\mathbf{X}^\top\mathbf{X}\) easily calculated from the \(\mathbf{R}\) matrix

crossprod(R)      ## = R'R = X'X

##        intcpt  x x^2
## intcpt      4  0  20
## x           0 20   0
## x^2        20  0 164

print(XtX)        ## previously calculated X'X

##        intcpt  x x^2
## intcpt      4  0  20
## x           0 20   0
## x^2        20  0 164

Matrix \(\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\) calculated from its Cholesky decomposition \(\mathbf{R}^\top\mathbf{R}\)

chol2inv(R)       ## inverse of X'X calculated from its Cholesky factor R (programme R uses the information that R matrix is upper triangular)

##           [,1] [,2]      [,3]
## [1,]  0.640625 0.00 -0.078125
## [2,]  0.000000 0.05  0.000000
## [3,] -0.078125 0.00  0.015625

print(invXtX)     ## compare to previously calculated (X'X)^(-1)

##           intcpt    x       x^2
## intcpt  0.640625 0.00 -0.078125
## x       0.000000 0.05  0.000000
## x^2    -0.078125 0.00  0.015625

Solution to normal equations by solving \(\mathbf{R}\,\mathbf{b} = \mathbf{Q}^\top\mathbf{Y}\)

(b_from_QR <- backsolve(R, crossprod(Q, y)))

##       [,1]
## [1,] -6.25
## [2,]  4.80
## [3,]  1.25

print(b)          ## previously calculated b for comparison

##         [,1]
## intcpt -6.25
## x       4.80
## x^2     1.25

Residual sum of squares

The residual sum of squares is also easily calculated, using the \(\mathbf{N}\) matrix (orthonormal basis of the residual space) since \[ \mathrm{SS}_e = \bigl(\mathbf{M}\mathbf{Y}\bigr)^\top\,\mathbf{M}\mathbf{Y} = \mathbf{Y}^\top\,\mathbf{M}^\top\mathbf{M}\,\mathbf{Y} = \mathbf{Y}^\top\,\mathbf{N}\mathbf{N}^\top\mathbf{N}\mathbf{N}^\top\,\mathbf{Y} = \mathbf{Y}^\top\,\mathbf{N}\mathbf{N}^\top\,\mathbf{Y} = \bigl\|\mathbf{N}^\top\,\mathbf{Y}\bigr\|^2. \]

(NtY <- crossprod(N, y))

##          [,1]
## [1,] 1.788854

(SSe <- as.numeric(crossprod(NtY)))

## [1] 3.2

Conclusions

It should now be clear that when dealing with the least squares, it is not necessary to calculate the matrix \(\mathbf{X}^\top\mathbf{X}\), or even its inverse directly from \(\mathbf{X}^\top\mathbf{X}\). Everything can be obtained from a suitable decomposition of the model matrix \(\mathbf{X}\), and once this decomposition is available, all the rest is easy (also with respect to computational demands).

NMSA407 Linear Regression: Tutorial

Dataset

Linear model

Raw solution to normal equations

Matrix \(\mathbf{X}^\top\mathbf{X}\)

Matrix \(\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\)

Solution to normal equations using already calculated inverse

Solution to normal equations calculated directly as a solution to a linear system

Linear algebra and solution to normal equations

Spectral decomposition of a symmetric positive semidefinite matrix \(\mathbf{X}^\top\mathbf{X}\)

Spectral decomposition of \(\mathbf{X}^\top\mathbf{X}\)

Elements of an object returned by the eigen function

Does it satisfy what it should satisfy?

Cholesky (square root) decomposition of a symmetric positive semidefinite matrix \(\mathbf{X}^\top\mathbf{X}\)

Cholesky decomposition of a matrix \(\mathbf{X}^\top\mathbf{X}\)

Does it satisfy what it should satisfy?

Calculate \(\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\) from the Cholesky decomposition

Determinant of \(\mathbf{X}^\top\mathbf{X}\)

Solution to normal equations via backward and forward substitution using the Cholesky decomposition of a matrix \(\mathbf{X}^\top\mathbf{X}\)

Singular value decomposition (SVD) of the model matrix \(\mathbf{X}\)

Singular value decomposition of a matrix \(\mathbf{X}\)

Content of an object returned by the svd function

Matrices \(\mathbf{D}^2,\,\mathbf{D}^{-1},\,\mathbf{D}^{-2}\)

Do the above matrices satisfy what they should satisfy?

Moore-Penrose pseudoinverse of a matrix \(\mathbf{X}\)

Properties of the Moore-Penrose pseudoinverse

\(\mathbf{H}\) matrix to project into \(M(\mathbf{X})\) (regression space)

\(\mathbf{M}\) matrix to project into the orthogonal complement of \(M(\mathbf{X})\) (residual space)

Fitted values and residuals

Fitted values and residuals using the solution to normal equations (to compare)

Matrix \(\mathbf{X}^\top\mathbf{X}\)

Matrix \(\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\)

QR decomposition of the model matrix \(\mathbf{X}\)

QR decomposition of a matrix \(\mathbf{X}\)

Content of an object returned by the qr function

Rank of the model matrix

\(\mathbf{Q}\) matrix

\(\mathbf{R}\) matrix

Is it really \(\mathbf{X} = \mathbf{Q}\,\mathbf{R}\)?

\(\mathbf{Q}\) matrix complemented by \(n - r\) columns to form an orthonormal basis of \(R^n\) (in our case \(R^4\))

Orthonormal basis of the residual space (orthogonal complement to \(M(\mathbf{X})\))

Projection matrix \(\mathbf{H}\) into the regression space \(M(\mathbf{X})\)

Projection matrix \(\mathbf{M}\) into the residual space (orthogonal complement to \(M(\mathbf{X})\))

Fitted values and residuals

Fitted values and residuals calculated using the solution to normal equations

Matrix \(\mathbf{X}^\top\mathbf{X}\) easily calculated from the \(\mathbf{R}\) matrix

Matrix \(\bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\) calculated from its Cholesky decomposition \(\mathbf{R}^\top\mathbf{R}\)

Solution to normal equations by solving \(\mathbf{R}\,\mathbf{b} = \mathbf{Q}^\top\mathbf{Y}\)

Residual sum of squares

Conclusions

Elements of an object returned by the `eigen` function

Content of an object returned by the `svd` function

Content of an object returned by the `qr` function