Letný semester 2024 | Cvičenie 6 | 02.04.2024
HTML Markdown – Rmd source file (UTF encoding)
Firstly, some necessary R libraries and the working dataset:
library("mffSM")
library("splines") # B-splines modelling
library("MASS") # function stdres for standardized residuals
### Working data
data(Dris, package = "mffSM")
help("Dris")
head(Dris)
## yield N P K Ca Mg
## 1 5.47 470 47 320 47 11
## 2 5.63 530 48 357 60 16
## 3 5.63 530 48 310 63 16
## 4 4.84 482 47 357 47 13
## 5 4.84 506 48 294 52 15
## 6 4.21 500 45 283 61 14
summary(Dris)
## yield N P K
## Min. :1.920 Min. :268.0 Min. :32.00 Min. :200.0
## 1st Qu.:4.040 1st Qu.:427.0 1st Qu.:43.00 1st Qu.:338.0
## Median :4.840 Median :473.0 Median :49.00 Median :377.0
## Mean :4.864 Mean :469.9 Mean :48.65 Mean :375.4
## 3rd Qu.:5.558 3rd Qu.:518.0 3rd Qu.:54.00 3rd Qu.:407.0
## Max. :8.792 Max. :657.0 Max. :72.00 Max. :580.0
## Ca Mg
## Min. :29.00 Min. : 8.00
## 1st Qu.:41.75 1st Qu.:10.00
## Median :49.00 Median :11.00
## Mean :51.45 Mean :11.64
## 3rd Qu.:59.00 3rd Qu.:13.00
## Max. :95.00 Max. :22.00
We will use the following model at the beginning:
aLogMg <- lm(yield ~ log2(Mg), data = Dris)
summary(aLogMg)
##
## Call:
## lm(formula = yield ~ log2(Mg), data = Dris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1194 -0.7412 -0.0741 0.7451 3.9841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4851 0.7790 1.907 0.0574 .
## log2(Mg) 0.9605 0.2208 4.349 1.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 366 degrees of freedom
## Multiple R-squared: 0.04915, Adjusted R-squared: 0.04655
## F-statistic: 18.92 on 1 and 366 DF, p-value: 1.772e-05
We will use the model above and we will closely inspect the model residuals. Three specific plots will be of particular interest:
plotLM(aLogMg)
First plot:
plot(aLogMg, which = 1)
### manual computation
points(fitted(aLogMg),residuals(aLogMg),lwd=2,col="blue4",bg="skyblue",pch=21)
lines(lowess(fitted(aLogMg),residuals(aLogMg)),col=2,lwd=3)
Second plot:
plot(aLogMg, which = 2, ask = FALSE)
### standardized residuals
H = qr.Q(aLogMg[["qr"]]) %*% t(qr.Q(aLogMg[["qr"]]))
M = diag(nrow(H))-H
stdres = residuals(aLogMg)/sqrt(deviance(aLogMg)/aLogMg$df.residual*diag(M))
qq = qqnorm(stdres,plot.it=FALSE)
points(qq$x, qq$y,lwd=2,col="blue4",bg="skyblue",pch=21)
all.equal(qq$y,stdres)
## [1] TRUE
qqline(stdres,lwd=2,lty=3)
Third plot
plot(aLogMg, which = 3, ask = FALSE)
### manual computation
points(fitted(aLogMg),sqrt(abs(stdres)),lwd=2,col="blue4",bg="skyblue",pch=21)
lines(lowess(fitted(aLogMg),sqrt(abs(stdres))),col=2,lwd=3)
Note, that there are actually infinitely many possibilities how to check that \(E(u_i| \boldsymbol{X}_i) = 0\) (a scatterplot of fitted values vs. residuals is only one of them, other common possibility is to plot the covariate values vs. the residuals.
plot(log2(Dris$Mg), residuals(aLogMg), xlab=expression(log[2](Mg)),
ylab="Residuals", main="Residuals vs. covariate", pch = 21, col = "blue4",
bg = "skyblue");
lines(lowess(residuals(aLogMg)~log2(Dris$Mg)), col="red")
Also check the folowing and explain the results:
sum(aLogMg$residuals)
## [1] -5.08274e-14
sum(aLogMg$residuals*log2(Dris$Mg))
## [1] -2.429376e-13
For some example of possible model improvements, we will consider a different dataset now and we will fit a series of different model that we will mutually compare.
A dataset providing a series of (indepeendent) measurements of head acceleration in a simulated motorcycle accident (used to test crash helmets) is given in the datafile Motocycle (within the R packages mffSM).
data(Motorcycle, package = "mffSM")
help(Motorcycle)
head(Motorcycle)
## time haccel
## 1 2.4 0.0
## 2 2.6 -1.3
## 3 3.2 -2.7
## 4 3.6 0.0
## 5 4.0 -2.7
## 6 6.2 -2.7
dim(Motorcycle)
## [1] 133 2
summary(Motorcycle)
## time haccel
## Min. : 2.40 Min. :-134.00
## 1st Qu.:15.80 1st Qu.: -54.90
## Median :24.00 Median : -13.30
## Mean :25.55 Mean : -25.55
## 3rd Qu.:35.20 3rd Qu.: 0.00
## Max. :65.60 Max. : 75.00
We are primarily interested in how the head acceleration (covariate haccel
) depends on the time after the accident (given in miliseconds – the timecovaraite).
BTY <- "o";
par(mfrow = c(1, 1), bty = BTY, mar = c(4, 4, 1, 1) + 0.1)
plot(haccel ~ time, data = Motorcycle, pch = 21, bg = "orange", col = "red3",
xlab = "Time [ms]", ylab = "Head acceleration [g]")
fit0 = lm(haccel~time, data=Motorcycle)
abline(fit0)
Using the diagnostic plots, we can clearly see that the model denoted as fit0 is hardly appropriate:
plotLM(fit0)
As far as there is a quite flexible shape observed in the plot we need some more flexibile model to capture the structure of the underlying conditional expectation of the acceleraton given the time. We will use polynomials for fiting the model (while still staying within the linear regression framework).
### Fitting polynomial of various degrees
fit6 <- lm(haccel~poly(time, 6, raw=TRUE), data=Motorcycle)
fit12 <- lm(haccel~poly(time, 12, raw=TRUE), data=Motorcycle)
The model can be visualized as follows:
xgrid <- seq(0, 60, length = 500)
par(mfrow = c(1, 1), bty = BTY, mar = c(4, 4, 1, 1) + 0.1)
plot(haccel ~ time, data = Motorcycle, pch = 21, bg = "orange", col = "red3",
xlab = "Time [ms]", ylab = "Head acceleration [g]")
lines(xgrid, predict(fit6, newdata=data.frame(time=xgrid)), lwd=2, col="blue")
lines(xgrid, predict(fit12, newdata=data.frame(time=xgrid)), lwd=2,
col="darkgreen")
legend(37,-70, lty=rep(1, 2), col=c("blue", "darkgreen"),
legend=paste("p=", c(6,12),sep=""), lwd=c(2,2), cex=1.2)
Classical polynomials are, however, not computationally very effective. Especially for higher order of the polynomial approximation the solution tends to be unstable.
One possible solution how to introduce some reasonable stability into the model where some higher order polynomials are used, are splines.
### Choice of knots
knots <- c(0, 11, 12, 13, 20, 30, 32, 34, 40, 50, 60)
(inner <- knots[-c(1, length(knots))]) # inner knots
## [1] 11 12 13 20 30 32 34 40 50
(bound <- knots[c(1, length(knots))]) # boundary knots
## [1] 0 60
### B-splines of degree 3
DEGREE <- 3; # piecewise cubic
Bgrid <- bs(xgrid, knots = inner, Boundary.knots = bound, degree = DEGREE,
intercept = TRUE)
dim(Bgrid) # 13-dimensional linear space of piecewise polynomials
## [1] 500 13
COL <- rainbow_hcl(ncol(Bgrid), c = 80)
par(mfrow = c(1, 1), bty = BTY, mar = c(4, 4, 1, 1) + 0.1)
plot(xgrid, Bgrid[,1], type = "l", col = COL[1], lwd = 2, xlab = "x",
ylab = "B(x)", xlim = range(knots), ylim = range(Bgrid, na.rm = TRUE))
for (j in 2:ncol(Bgrid)) lines(xgrid, Bgrid[,j], col = COL[j], lwd = 2)
points(knots, rep(0, length(knots)), pch = 21, bg = "blue", col = "skyblue",
cex = 1.5)
Thus, we can used B-splines to fit the model.
More details about the B-splines can be found, for instance, here http://www.chebfun.org/examples/approx/BSplineConv.html
m1 <- lm(haccel ~ bs(time, knots = inner, Boundary.knots = bound,
degree = DEGREE, intercept = TRUE) - 1, data = Motorcycle)
names(m1$coefficients) <- paste("B", 1:13)
summary(m1)
##
## Call:
## lm(formula = haccel ~ bs(time, knots = inner, Boundary.knots = bound,
## degree = DEGREE, intercept = TRUE) - 1, data = Motorcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.857 -11.740 -0.268 10.356 55.249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## B 1 -11.614 83.252 -0.140 0.889
## B 2 12.450 81.282 0.153 0.879
## B 3 -13.988 45.345 -0.308 0.758
## B 4 2.984 18.894 0.158 0.875
## B 5 6.115 12.139 0.504 0.615
## B 6 -237.283 18.850 -12.588 < 2e-16 ***
## B 7 17.346 14.094 1.231 0.221
## B 8 53.248 12.635 4.214 4.88e-05 ***
## B 9 5.102 13.051 0.391 0.697
## B 10 12.609 17.146 0.735 0.464
## B 11 -21.738 23.825 -0.912 0.363
## B 12 10.879 16.284 0.668 0.505
## B 13 7.885 17.624 0.447 0.655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.61 on 120 degrees of freedom
## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8278
## F-statistic: 50.19 on 13 and 120 DF, p-value: < 2.2e-16
Note, that as the intercept is not formally included in in the model (because of the -1 in the model formula), neither ‘Multiple R-squared’ nor the ‘F-statistic’ quantity in the output are meaningful numbers.
### Indidividual work Can you explain why the values reported for \(R^2\) and \(R_{adj}^2\) are not appropriate? What is the right meaning of this numbers? Can you manually reconstruct them?
The \(R^2\) value from the model is
summary(m1)$r.squared
## [1] 0.8446597
and the manual calculations will give
1 - deviance(m1)/sum(Motorcycle$haccel^2)
## [1] 0.8446597
sum(fitted(m1)^2)/sum(Motorcycle$haccel^2)
## [1] 0.8446597
Interpretation the F-test from the summary output above is a test of a submodel
summary(m1)$fstatistic
## value numdf dendf
## 50.19211 13.00000 120.00000
# the usual F-test is the following
m2 <- lm(haccel ~ 1, data = Motorcycle);
anova(m2,m1)
## Analysis of Variance Table
##
## Model 1: haccel ~ 1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE,
## intercept = TRUE) - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 132 308223
## 2 120 61362 12 246861 40.23 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# in a model without intercept it is this one
(m0 <- lm(haccel ~ -1, data=Motorcycle))
##
## Call:
## lm(formula = haccel ~ -1, data = Motorcycle)
##
## No coefficients
anova(m0,m1)
## Analysis of Variance Table
##
## Model 1: haccel ~ -1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE,
## intercept = TRUE) - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 133 395017
## 2 120 61362 13 333655 50.192 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Nevertheless, as the intercept paramter is in the linear span of the columns of $, it makes sense to calculcate the (usual) multiple \(R^2\) value. This can do it ‘manually’ as follows:
var(fitted(m1))/var(Motorcycle$haccel)
## [1] 0.8009163
cor(Motorcycle$haccel, fitted(m1))^2
## [1] 0.8009163
1 - deviance(m1)/deviance(m2)
## [1] 0.8009163
Alternatively one can use B-splines (the basis without intercept) with an intercept in the model to get some meaningful numbers in the output
m0 <- lm(haccel ~ bs(time, knots = inner, Boundary.knots = bound,
degree = DEGREE, intercept = FALSE), data = Motorcycle)
names(m0$coefficients) <- c("(Intercept)", paste("B", 1:12))
summary(m0)
##
## Call:
## lm(formula = haccel ~ bs(time, knots = inner, Boundary.knots = bound,
## degree = DEGREE, intercept = FALSE), data = Motorcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.857 -11.740 -0.268 10.356 55.249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.614 83.252 -0.140 0.8893
## B 1 24.064 161.970 0.149 0.8821
## B 2 -2.374 57.835 -0.041 0.9673
## B 3 14.598 93.327 0.156 0.8760
## B 4 17.729 80.496 0.220 0.8261
## B 5 -225.669 88.698 -2.544 0.0122 *
## B 6 28.960 82.780 0.350 0.7271
## B 7 64.862 84.773 0.765 0.4457
## B 8 16.716 83.997 0.199 0.8426
## B 9 24.223 85.172 0.284 0.7766
## B 10 -10.124 86.455 -0.117 0.9070
## B 11 22.493 84.873 0.265 0.7915
## B 12 19.498 85.159 0.229 0.8193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.61 on 120 degrees of freedom
## Multiple R-squared: 0.8009, Adjusted R-squared: 0.781
## F-statistic: 40.23 on 12 and 120 DF, p-value: < 2.2e-16
anova(m2,m0)
## Analysis of Variance Table
##
## Model 1: haccel ~ 1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE,
## intercept = FALSE)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 132 308223
## 2 120 61362 12 246861 40.23 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What about the interpretation of the classical diagnostic plots for the model fitted with B-slines?
plotLM(m1)