Matrix algebra background of linear regression
Everything in this tutorial will be exemplified on pseudodata that we now create.
y <- c(-9, -11, 1, 19)
x <- c(-3, -1, 1, 3)
x2 <- c(9, 1, 1, 9)
x2 <- x^2 ### the same
X <- cbind(1, x, x2)
colnames(X) <- c("intcpt", "x", "x^2")
print(X)
## intcpt x x^2
## [1,] 1 -3 9
## [2,] 1 -1 1
## [3,] 1 1 1
## [4,] 1 3 9
Data <- data.frame(y = y, x = x, x2 = x2)
print(Data)
## y x x2
## 1 -9 -3 9
## 2 -11 -1 1
## 3 1 1 1
## 4 19 3 9
We will be fitting the following linear model: \[ \mathsf{E}Y = \beta_0 + \beta_1\,x + \beta_2\,x^2. \]
First, we calculate solution to normal equations directly from theoretically derived expressions, i.e., as \(\widehat{\beta} = \bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1}\mathbf{X}^\top\mathbf{Y}\).
XtX <- t(X) %*% X
XtX <- crossprod(X) ## the same
print(XtX)
## intcpt x x^2
## intcpt 4 0 20
## x 0 20 0
## x^2 20 0 164
invXtX <- solve(XtX)
print(invXtX)
## intcpt x x^2
## intcpt 0.640625 0.00 -0.078125
## x 0.000000 0.05 0.000000
## x^2 -0.078125 0.00 0.015625
b <- invXtX %*% t(X) %*% y
print(b)
## [,1]
## intcpt -6.25
## x 4.80
## x^2 1.25
solve(XtX, t(X) %*% y)
## [,1]
## intcpt -6.25
## x 4.80
## x^2 1.25
solve(crossprod(X), crossprod(X, y)) ## the same
## [,1]
## intcpt -6.25
## x 4.80
## x^2 1.25
Calculation of the solution to normal equations using the above “raw” approach is numerically not too optimal. In practice, numerically more stable and much better methods are used which are based on decompositions of the model matrix \(\mathbf{X}\) known from linear algebra. The advantage of these methods is also the fact that once the decomposition of the model matrix \(\mathbf{X}\) is available, calculation of all the rest (fitted values, residuals, solution to normal equations, …) is trivial (and also computationally fast). This is in particular useful in situations when the least squares solutions are needed for a series of “response vectors” \(\mathbf{Y}\) while keeping the model matrix \(\mathbf{X}\) fixed.
The most often used decompositions are
lm
function in R);In the following, the above two decompositions will be explored in more detail. Nevertheless first, we explore other two decompositions of a symmetric positive semidefinite matrix \(\mathbf{X}^\top\mathbf{X}\) that can also be used to solve the normal equations \(\mathbf{X}^\top\mathbf{X}\,\mathbf{b}= \mathbf{X}^\top\mathbf{Y}\) and possibly for some other tasks.
These will be
Both the spectral decomposition and the Cholesky decomposition should be known from the bachelor courses Linear algebra and Fundamentals of Numerical Mathematics, Cholesky decomposition perhaps in a more general form of the LU decomposition.
\[ \mathbf{X}^\top\mathbf{X} = \mathbf{V}\,\Lambda\,\mathbf{V}^\top = \sum_{j=1}^p \lambda_j\,\mathbf{v}_j\,\mathbf{v}_j^\top, \qquad \lambda_j \geq 0,\;\;j=1,\ldots,p. \]
It is additionally satisfied: \[ \mathbf{V}\,\mathbf{V}^\top = \sum_{j=1}^p \mathbf{v}_j\,\mathbf{v}_j^\top = \mathbf{I}_p, \] \[ \mathbf{V}^\top\,\mathbf{V} = \bigl(\mathbf{v}_j^\top\,\mathbf{v}_l\bigr) = \mathbf{I}_p. \] That is, \[ \mathbf{V}^{-1} = \mathbf{V}^\top. \] If \(\lambda_j > 0\) for all \(j=1,\ldots,p\) then \[ \bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1} = \sum_{j=1}^p \lambda_j^{-1}\,\mathbf{v}_j\,\mathbf{v}_j^\top = \mathbf{V}\,\Lambda^{-1}\,\mathbf{V}^\top. \]
eigen(XtX)
## $values
## [1] 166.462113 20.000000 1.537887
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.1221833 0 0.9925076
## [2,] 0.0000000 -1 0.0000000
## [3,] -0.9925076 0 -0.1221833
eigen
functione <- eigen(XtX)
names(e)
## [1] "values" "vectors"
V <- e$vectors
Lambda <- diag(e$values)
V %*% Lambda %*% t(V)
## [,1] [,2] [,3]
## [1,] 4 0 20
## [2,] 0 20 0
## [3,] 20 0 164
round(V %*% t(V), 13) ## only numerically identity matrix
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
round(t(V) %*% V, 13) ## only numerically identity matrix
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
round(XtX %*% V - V %*% Lambda, 13) ## X'X * V = V * Lambda
## [,1] [,2] [,3]
## intcpt 0 0 0
## x 0 0 0
## x^2 0 0 0
invLambda <- diag(1 / e$values)
V %*% invLambda %*% t(V)
## [,1] [,2] [,3]
## [1,] 0.640625 0.00 -0.078125
## [2,] 0.000000 0.05 0.000000
## [3,] -0.078125 0.00 0.015625
round(V %*% invLambda %*% t(V) - invXtX, 13) ## V * Lambda^(-1) * V' = (X'X)^(-1)
## intcpt x x^2
## intcpt 0 0 0
## x 0 0 0
## x^2 0 0 0
\[ \mathbf{X}^\top\mathbf{X} = \mathbf{L}\,\mathbf{L}^\top, \] where \(\mathbf{L}\) is a lower triangular matrix.
This is a special case of so called LU decomposition where in the case of a symmetric positive semidefinite matrix \(\mathbf{U} = \mathbf{L}^\top\) (or \(\mathbf{L} = \mathbf{U}^\top\)).
In statistics, symmetric positive semidefinite matrices are quite often faced with as all covariance matrices are of this type. The Cholesky decomposition is in fact a generalization of a square root to matrices (as will immediately be shown). The Cholesky decomposition of a covariance matrix is then in fact a multivariate version of a standard deviation. Its calculation then simplifies many problems.
If \(\mathbf{D}\) is a symmetric positive semidefinite matrix (e.g., a covariance matrix or a matrix \(\mathbf{X}^\top\mathbf{X}\) in a regression context) then it can be decomposed as \(\mathbf{D} = \mathbf{L}\,\mathbf{L}^\top\), where \(\mathbf{L}\) is lower triangular matrix. Among other things, the following tasks are practically trivial when \(\mathbf{L}\) is already available.
Linear systems \(\mathbf{D}\,\mathbf{b} = \mathbf{c}\) are easily solved by backward and forward substitution:
This is especially useful when the linear system \(\mathbf{D}\,\mathbf{b} = \mathbf{c}\) is to be solved for (many) different right-hand-side vectors \(\mathbf{c}\).
\(\mathsf{det}(\mathbf{D})\) \(= \mathsf{det}\bigl(\mathbf{L}\,\mathbf{L}^\top\bigr)\) \(= \mathsf{det}(\mathbf{L})\,\mathsf{det}(\mathbf{L})\) \(= (\mbox{product of diagonal elements of }\mathbf{L})^2\).
\(\mathbf{D}^{-1}\) \(= \bigl(\mathbf{L}\,\mathbf{L}^\top\bigr)^{-1}\) \(= \bigl(\mathbf{L}^{-1}\bigr)^\top\,\mathbf{L}^{-1}\). That is, to calculate \(\mathbf{D}^{-1}\), it is only needed to invert a triangular matrix (which is easy).
If a random vector \(\mathbf{Z}\) has a distribution with a unit covariance matrix then \(\mathbf{L}\,\mathbf{Z}\) has a distribution with a covariance matrix \(\mathbf{D}\). This is useful for random numbers generation.
R returns matrix \(\mathbf{U} = \mathbf{L}^\top\).
chol(XtX)
## intcpt x x^2
## intcpt 2 0.000000 10
## x 0 4.472136 0
## x^2 0 0.000000 8
U <- chol(XtX)
t(U) %*% U
## intcpt x x^2
## intcpt 4 0 20
## x 0 20 0
## x^2 20 0 164
chol2inv(U)
## [,1] [,2] [,3]
## [1,] 0.640625 0.00 -0.078125
## [2,] 0.000000 0.05 0.000000
## [3,] -0.078125 0.00 0.015625
invXtX2 <- chol2inv(U)
round(invXtX2 - invXtX, 13) ## compare with previously calculated (X'X)^(-1), numerically the same
## intcpt x x^2
## intcpt 0 0 0
## x 0 0 0
## x^2 0 0 0
prod(diag(U))^2
## [1] 5120
det(XtX) ## R function to calculate determinant
## [1] 5120
We need to solve \(\mathbf{X}^\top\mathbf{X}\,\mathbf{b} = \mathbf{X}^\top\mathbf{Y}\). Since \(\mathbf{X}^\top\mathbf{X} = \mathbf{U}^\top\mathbf{U}\), we have to solve \(\mathbf{U}^\top\mathbf{U}\,\mathbf{b} = \mathbf{X}^\top\mathbf{Y}\).
(v <- forwardsolve(t(U), crossprod(X, y))) ## solution to U' * v = X'Y
## [,1]
## [1,] 0.00000
## [2,] 21.46625
## [3,] 10.00000
(bchol <- backsolve(U, v)) ## solution to U * b = v
## [,1]
## [1,] -6.25
## [2,] 4.80
## [3,] 1.25
print(b) ## compare to previously calculated solution
## [,1]
## intcpt -6.25
## x 4.80
## x^2 1.25
Suppose that \(n = \mathrm{nrow}(\mathbf{X})\), \(k = \mathrm{ncol}(\mathbf{X})\), \(r = \mathrm{rank}(\mathbf{X}) \leq k \leq n\). The singular value decomposition (SVD) of the model matrix \(\mathbf{X}\) takes the following form: \[ \mathbf{X} = \mathbf{U}\,\mathbf{D}\,\mathbf{V}^\top, \] where
Numbers \(d_1,\ldots,d_k\) are called singular values of the matrix \(\mathbf{X}\). It holds:
Further,
\(\mathbf{V}\,\mathbf{V}^\top = \sum_{j=1}^k\mathbf{v}_j\,\mathbf{v}_j^\top = \mathbf{I}_{k\times k}\).
\(\mathbf{X} = \mathbf{U}\,\mathbf{D}\,\mathbf{V}^\top = \sum_{j=1}^k d_j\,\mathbf{u}_j\,\mathbf{v}_j^\top\);
\(\mathbf{X}^\top\,\mathbf{X} = \mathbf{U}\,\mathbf{D}\,\mathbf{D}\,\mathbf{V}^\top = \sum_{j=1}^k d_j^2\,\mathbf{u}_j\,\mathbf{v}_j^\top\).
If for all \(j=1,\ldots,k\), \(d_j > 0\), matrix \(\mathbf{X}^\top\,\mathbf{X}\) is invertible and \[ \bigl(\mathbf{X}^\top\,\mathbf{X}\bigr)^{-1} = \mathbf{U}\,\mathbf{D}^{-2}\,\mathbf{V}^\top = \sum_{j=1}^k \frac{1}{d_j^2}\,\mathbf{u}_j\,\mathbf{v}_j^\top, \] where \(\mathbf{D}^{-2}\) is a diagonal matrix with \(d_1^{-2},\ldots,d_k^{-2}\) on a diagonal.
Since matrix \(\mathbf{U}\) contains an orthonormal basis of \(M(\mathbf{X})\) in its columns, the projection matrix into \(M(\mathbf{X})\) is \[ \mathbf{H} = \mathbf{U}\,\mathbf{U}^\top. \]
The SVD decomposition can also be used to calculate the Moore-Penrose pseudoinverse matrix to \(\mathbf{X}\): \[ \mathbf{X}^- = \tilde{\mathbf{V}}\,{\tilde{\mathbf{D}}}^{-1}\,{\tilde{\mathbf{U}}}^\top, \] where matrices \(\tilde{\mathbf{V}}_{k\times r},\,\tilde{\mathbf{D}}_{r\times r},\,\tilde{\mathbf{U}}_{n\times r}\) contains/corresponds only to positive singular values.
svd(X)
## $d
## [1] 12.902020 4.472136 1.240116
##
## $u
## [,1] [,2] [,3]
## [1,] 0.70180882 0.6708204 -0.08639661
## [2,] 0.08639661 0.2236068 0.70180882
## [3,] 0.08639661 -0.2236068 0.70180882
## [4,] 0.70180882 -0.6708204 -0.08639661
##
## $v
## [,1] [,2] [,3]
## [1,] 0.1221833 0 0.9925076
## [2,] 0.0000000 -1 0.0000000
## [3,] 0.9925076 0 -0.1221833
svd
functionsvdX <- svd(X)
names(svdX)
## [1] "d" "u" "v"
(U <- svdX$u)
## [,1] [,2] [,3]
## [1,] 0.70180882 0.6708204 -0.08639661
## [2,] 0.08639661 0.2236068 0.70180882
## [3,] 0.08639661 -0.2236068 0.70180882
## [4,] 0.70180882 -0.6708204 -0.08639661
(D <- diag(svdX$d))
## [,1] [,2] [,3]
## [1,] 12.90202 0.000000 0.000000
## [2,] 0.00000 4.472136 0.000000
## [3,] 0.00000 0.000000 1.240116
(V <- svdX$v)
## [,1] [,2] [,3]
## [1,] 0.1221833 0 0.9925076
## [2,] 0.0000000 -1 0.0000000
## [3,] 0.9925076 0 -0.1221833
(D2 <- diag(svdX$d^2)) ## = D * D
## [,1] [,2] [,3]
## [1,] 166.4621 0 0.000000
## [2,] 0.0000 20 0.000000
## [3,] 0.0000 0 1.537887
(iD <- diag(1 / svdX$d)) ## = D^(-1)
## [,1] [,2] [,3]
## [1,] 0.07750724 0.0000000 0.0000000
## [2,] 0.00000000 0.2236068 0.0000000
## [3,] 0.00000000 0.0000000 0.8063762
(iD2 <- diag(1 / svdX$d^2)) ## = (D * D)^(-1)
## [,1] [,2] [,3]
## [1,] 0.006007373 0.00 0.0000000
## [2,] 0.000000000 0.05 0.0000000
## [3,] 0.000000000 0.00 0.6502426
U %*% D %*% t(V) ### --> X
## [,1] [,2] [,3]
## [1,] 1 -3 9
## [2,] 1 -1 1
## [3,] 1 1 1
## [4,] 1 3 9
V %*% D %*% t(D) %*% t(V) ### --> X'X
## [,1] [,2] [,3]
## [1,] 4 0 20
## [2,] 0 20 0
## [3,] 20 0 164
round(t(U) %*% U, 13) ### --> I (numerically)
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
round(t(V) %*% V, 13) ### --> I (numerically)
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
round(V %*% V, 13) ### --> I (numerically)
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
Xmin <- V %*% diag(1/svdX$d) %*% t(U)
print(Xmin)
## [,1] [,2] [,3] [,4]
## [1,] -0.0625 0.5625 0.5625 -0.0625
## [2,] -0.1500 -0.0500 0.0500 0.1500
## [3,] 0.0625 -0.0625 -0.0625 0.0625
round(X %*% Xmin %*% X - X, 13) ## X * X^- * X = X
## intcpt x x^2
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
## [4,] 0 0 0
round(Xmin %*% X %*% Xmin - Xmin, 13) ## X^- * X * X^- = X^-
## [,1] [,2] [,3] [,4]
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
round(t(X %*% Xmin) - X %*% Xmin, 13) ## (X * X^-)' = X * X^-
## [,1] [,2] [,3] [,4]
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
## [4,] 0 0 0 0
round(t(Xmin %*% X) - Xmin %*% X, 13) ## (X^- * X)' = X^- * X
## [,1] [,2] [,3]
## intcpt 0 0 0
## x 0 0 0
## x^2 0 0 0
(H <- U %*% t(U))
## [,1] [,2] [,3] [,4]
## [1,] 0.95 0.15 -0.15 0.05
## [2,] 0.15 0.55 0.45 -0.15
## [3,] -0.15 0.45 0.55 0.15
## [4,] 0.05 -0.15 0.15 0.95
(M <- diag(nrow(X)) - H)
## [,1] [,2] [,3] [,4]
## [1,] 0.05 -0.15 0.15 -0.05
## [2,] -0.15 0.45 -0.45 0.15
## [3,] 0.15 -0.45 0.45 -0.15
## [4,] -0.05 0.15 -0.15 0.05
(yhat <- H %*% y)
## [,1]
## [1,] -9.4
## [2,] -9.8
## [3,] -0.2
## [4,] 19.4
(u <- M %*% y)
## [,1]
## [1,] 0.4
## [2,] -1.2
## [3,] 1.2
## [4,] -0.4
(yhat_from_b <- X %*% b)
## [,1]
## [1,] -9.4
## [2,] -9.8
## [3,] -0.2
## [4,] 19.4
(u_from_b <- y - yhat_from_b)
## [,1]
## [1,] 0.4
## [2,] -1.2
## [3,] 1.2
## [4,] -0.4
V %*% D2 %*% t(V) ## X'X
## [,1] [,2] [,3]
## [1,] 4 0 20
## [2,] 0 20 0
## [3,] 20 0 164
crossprod(D %*% t(V)) ## X'X calculated even in a better way
## [,1] [,2] [,3]
## [1,] 4 0 20
## [2,] 0 20 0
## [3,] 20 0 164
print(XtX) ## previously calculated X'X
## intcpt x x^2
## intcpt 4 0 20
## x 0 20 0
## x^2 20 0 164
V %*% iD2 %*% t(V) ## (X'X)^(-1)
## [,1] [,2] [,3]
## [1,] 0.640625 0.00 -0.078125
## [2,] 0.000000 0.05 0.000000
## [3,] -0.078125 0.00 0.015625
crossprod(iD %*% t(V)) ## (X'X)^(-1) calculated even in a better way
## [,1] [,2] [,3]
## [1,] 0.640625 0.00 -0.078125
## [2,] 0.000000 0.05 0.000000
## [3,] -0.078125 0.00 0.015625
print(invXtX) ## previously calculated X'X
## intcpt x x^2
## intcpt 0.640625 0.00 -0.078125
## x 0.000000 0.05 0.000000
## x^2 -0.078125 0.00 0.015625
Still suppose that \(n = \mathrm{nrow}(\mathbf{X})\), \(k = \mathrm{ncol}(\mathbf{X})\), \(r = \mathrm{rank}(\mathbf{X}) \leq k \leq n\). The QR decomposition of the model matrix \(\mathbf{X}\) takes the following form: \[ \mathbf{X} = \mathbf{Q}\,\mathbf{R}, \] where
Further, \[ \mathbf{X}^\top\mathbf{X} = \mathbf{R}^\top\,\mathbf{Q}^\top\mathbf{Q}\,\mathbf{R} = \mathbf{R}^\top\mathbf{R}, \] since the columns of the matrix \(\mathbf{Q}\) form an orthonormal basis and hence \(\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}_k\).
System of normal equations is then easily solvable by one backward substition since: \[ \mathbf{X}^\top\mathbf{X}\,\mathbf{b} = \mathbf{X}^\top\mathbf{Y} \] \[ \mathbf{R}^\top\mathbf{R}\,\mathbf{b} = \mathbf{R}^\top\mathbf{Q}^\top\,\mathbf{Y} \] \[ \mathbf{R}\,\mathbf{b} = \mathbf{Q}^\top\,\mathbf{Y}. \] That is, it is only requested to solve one linear system with an upper triangular matrix.
If \(r = k\) then \(\mathbf{X}^\top\mathbf{X}\) is invertible and \[ \bigl(\mathbf{X}^\top\mathbf{X}\bigr)^{-1} = \bigl(\mathbf{R}^\top\mathbf{R}\bigr)^{-1} = \mathbf{R}^{-1}\bigl(\mathbf{R}^{-1}\bigr)^\top. \] That is, it is only needed to invert a triangular matrix (which is easy).
qr(X)
## $qr
## intcpt x x^2
## [1,] -2.0 0.0000000 -10.0000000
## [2,] 0.5 -4.4721360 0.0000000
## [3,] 0.5 0.4472136 8.0000000
## [4,] 0.5 0.8944272 -0.9296181
##
## $rank
## [1] 3
##
## $qraux
## [1] 1.500000 1.000000 1.368524
##
## $pivot
## [1] 1 2 3
##
## attr(,"class")
## [1] "qr"
Note that results are stored and shown in a compact form.
qr
functionqrX <- qr(X)
names(qrX)
## [1] "qr" "rank" "qraux" "pivot"
qrX$rank
## [1] 3
r <- qrX$rank
n <- nrow(X)
In our case, \(r = k = 3\), all columns of \(\mathbf{Q}\) form an orthonormal basis of \(M(\mathbf{X})\).
qr.Q(qrX)
## [,1] [,2] [,3]
## [1,] -0.5 0.6708204 0.5
## [2,] -0.5 0.2236068 -0.5
## [3,] -0.5 -0.2236068 -0.5
## [4,] -0.5 -0.6708204 0.5
Q <- qr.Q(qrX)
qr.R(qrX)
## intcpt x x^2
## [1,] -2 0.000000 -10
## [2,] 0 -4.472136 0
## [3,] 0 0.000000 8
R <- qr.R(qrX)
Q %*% R - X ## not really zeros
## intcpt x x^2
## [1,] 0 0 0.000000e+00
## [2,] 0 0 0.000000e+00
## [3,] 0 0 8.881784e-16
## [4,] 0 0 0.000000e+00
round(Q %*% R - X, 14) ## numerically yes
## intcpt x x^2
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
## [4,] 0 0 0
The orthonormal basis of \(R^n\) will be stored in a matrix \(\mathbf{P}\).
(P <- qr.Q(qrX, complete = TRUE))
## [,1] [,2] [,3] [,4]
## [1,] -0.5 0.6708204 0.5 0.2236068
## [2,] -0.5 0.2236068 -0.5 -0.6708204
## [3,] -0.5 -0.2236068 -0.5 0.6708204
## [4,] -0.5 -0.6708204 0.5 -0.2236068
The last \(n-r\) columns of the matrix \(\mathbf{P}\) form an orthonormal basis of the residual space. We store this basis in a matrix \(\mathbf{N}\).
(N <- P[, (r + 1):n])
## [1] 0.2236068 -0.6708204 0.6708204 -0.2236068
(H <- Q %*% t(Q))
## [,1] [,2] [,3] [,4]
## [1,] 0.95 0.15 -0.15 0.05
## [2,] 0.15 0.55 0.45 -0.15
## [3,] -0.15 0.45 0.55 0.15
## [4,] 0.05 -0.15 0.15 0.95
(M <- N %*% t(N))
## [,1] [,2] [,3] [,4]
## [1,] 0.05 -0.15 0.15 -0.05
## [2,] -0.15 0.45 -0.45 0.15
## [3,] 0.15 -0.45 0.45 -0.15
## [4,] -0.05 0.15 -0.15 0.05
(yhat <- H %*% y)
## [,1]
## [1,] -9.4
## [2,] -9.8
## [3,] -0.2
## [4,] 19.4
(u <- M %*% y)
## [,1]
## [1,] 0.4
## [2,] -1.2
## [3,] 1.2
## [4,] -0.4
(yhat_from_b <- X %*% b)
## [,1]
## [1,] -9.4
## [2,] -9.8
## [3,] -0.2
## [4,] 19.4
(u_from_b <- y - yhat_from_b)
## [,1]
## [1,] 0.4
## [2,] -1.2
## [3,] 1.2
## [4,] -0.4
crossprod(R) ## = R'R = X'X
## intcpt x x^2
## intcpt 4 0 20
## x 0 20 0
## x^2 20 0 164
print(XtX) ## previously calculated X'X
## intcpt x x^2
## intcpt 4 0 20
## x 0 20 0
## x^2 20 0 164
chol2inv(R) ## inverse of X'X calculated from its Cholesky factor R (programme R uses the information that R matrix is upper triangular)
## [,1] [,2] [,3]
## [1,] 0.640625 0.00 -0.078125
## [2,] 0.000000 0.05 0.000000
## [3,] -0.078125 0.00 0.015625
print(invXtX) ## compare to previously calculated (X'X)^(-1)
## intcpt x x^2
## intcpt 0.640625 0.00 -0.078125
## x 0.000000 0.05 0.000000
## x^2 -0.078125 0.00 0.015625
(b_from_QR <- backsolve(R, crossprod(Q, y)))
## [,1]
## [1,] -6.25
## [2,] 4.80
## [3,] 1.25
print(b) ## previously calculated b for comparison
## [,1]
## intcpt -6.25
## x 4.80
## x^2 1.25
The residual sum of squares is also easily calculated, using the \(\mathbf{N}\) matrix (orthonormal basis of the residual space) since \[ \mathrm{SS}_e = \bigl(\mathbf{M}\mathbf{Y}\bigr)^\top\,\mathbf{M}\mathbf{Y} = \mathbf{Y}^\top\,\mathbf{M}^\top\mathbf{M}\,\mathbf{Y} = \mathbf{Y}^\top\,\mathbf{N}\mathbf{N}^\top\mathbf{N}\mathbf{N}^\top\,\mathbf{Y} = \mathbf{Y}^\top\,\mathbf{N}\mathbf{N}^\top\,\mathbf{Y} = \bigl\|\mathbf{N}^\top\,\mathbf{Y}\bigr\|^2. \]
(NtY <- crossprod(N, y))
## [,1]
## [1,] 1.788854
(SSe <- as.numeric(crossprod(NtY)))
## [1] 3.2
It should now be clear that when dealing with the least squares, it is not necessary to calculate the matrix \(\mathbf{X}^\top\mathbf{X}\), or even its inverse directly from \(\mathbf{X}^\top\mathbf{X}\). Everything can be obtained from a suitable decomposition of the model matrix \(\mathbf{X}\), and once this decomposition is available, all the rest is easy (also with respect to computational demands).