table width = 900 border = 0><tr><td>
The R-software is available for download from the official website: https://www.r-project.org
A user-friendly interface (one of many): RStudio
Some useful manuals and tutorials for R (in Czech or English):
We have already discussed some clustering techniques where the available observations are assigned to some groups with respec to their mutual similarities (depending on the similarity measure or the distance which is used to assess the differences among objetcs).
However, a fundamental model for clustering – a mixture model – is more complex than that. In general, it is standardly assumed that the observations are randomly drawn from K different populations but instead of providing just the corresponding labels for the observations (labeling observations accorging to their assignment to the groups) one estimates the whole population with the mixture model. Thus, a much more complex information is obtained as a results. On the other hand, more assumptions are needed as well.
Let us consider the data on the consumption of automobiles in the United states in 80’s (the dataset ‘mtcars’ available in R). The data consists of 32 observations (32 different cars on the US market available that time) and 11 covariates (some of them are continious, some of them are categorical). For an additional information on this dataset use the help session in R (type ?mtcars
).
rm(list = ls())
attach(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Using the given data, we can apriori assume three different populations – for instance, the populations determined by the number of cylinders (four, six, and eight cylinder cars). Each population has its own specific characteristics – such as the mean, variance, distribution, or density. Considering, for instance, the car consumption (miles per gallon) we can estimate the proportion of each population in the given sample and, imposing the normality condition with the unit variance, we can estimate the overall mixture model by estimating the corresponding mean values and plugging everything all together. The estimated mixture distribution is the following:
### population mean estimates
mean1 <- mean(mpg[cyl == 4])
mean2 <- mean(mpg[cyl == 6])
mean3 <- mean(mpg[cyl == 8])
### estimated proportions for each population
prop <- table(cyl)/dim(mtcars)[1]
### the overall mixture distribution
mixture <- function(x, prop, means){
mixValue <- 0
for (i in 1:length(prop)){
mixValue <- mixValue + prop[i] * dnorm(x, means[i])
}
return(mixValue)
}
xSeq <- seq(10, 34, length = 100)
density <- mixture(xSeq, prop, c(mean1, mean2, mean3))
plot(density ~ xSeq, col = "red", type = "l", xlab = "Miles per Gallon", ylab = "Density")
The mixture is composed from three mixtures (normal distributions with the unit variance and some corresponding mean parameters).
plot(density ~ xSeq, col = "red", type = "l", lwd = 3, xlab = "Miles per Gallon", ylab = "Density")
d1 <- density(rnorm(10000, mean1, 1))
d2 <- density(rnorm(10000, mean2, 1))
d3 <- density(rnorm(10000, mean3, 1))
lines(d1$x, prop[1] * d1$y, col = "black", lwd = 2)
lines(d2$x, prop[2] * d2$y, col = "violet", lwd = 2)
lines(d3$x, prop[3] * d3$y, col = "gray", lwd = 2)
The joint density (mixture distribution) has multiple nodes – one for each cluster mean \(\mu_k\), for \(k = 1, 2, 3\). However, for two centroids which are close enough, it is possible that only one node will be present in the final mixture.
In practical problems the mixture estimation becomes way more complex. Usually, it is common that the number of populations included in the mixture is not known apriori or the variance can differ across different populations.
Maximum likelihood methods are typically used to estimate the overall mixture models but some regularity conditions (restrictions) are required to obtain some reasonable solutions at the end.
What will happen, if you assume the same approach as above (a normal mixture model with unit variance), however, with no restriction regarding the number of the underlying populations? The minimization problem can be formulated as \[ Maximize \prod_{k = 1}^{K} \prod_{i = 1}^{n_k} f_k(X_{k i}) \equiv Maximize \prod_{k = 1}^{K} \prod_{i = 1}^{n_k} \frac{1}{\sqrt{2 \pi}} \cdot exp \left\{\frac{-1}{2} (X_{k i} - \mu_k)^2\right\} \] where the maximization takes place with respect to \(K \in \{1, \dots, n\}\), and \(\mu_1, \dots, \mu_K \in \mathbb{R}\) (with \(n \in \mathbb{N}\) being the overall number of observations and \(n_k \in \mathbb{N}\) being the number of subjects from the \(k\)-th population – observations \(X_{k 1}, \dots, X_{k n_k}\)). If no further restrictions are defined, the final mixture model will look like this:
plot(density ~ xSeq, col = "gray", type = "l", lwd = 2, xlab = "Miles per Gallon", ylab = "Density", lty = 2, ylim = c(0,0.5))
for (i in 1:length(mpg)){
lines(rep(mpg[i], 2), c(dnorm(0), 0), col = "red", lwd = 1, lty = 3)
points(mpg[i], dnorm(0), pch = 21, cex = 0.5, bg = "red")
}
Mixture models are also used in a regression modelling. As an illustration, consider a GLM model (with a logarithmic link) for Poisson counts (for instance, modelling the number of positive COVID tests in some subpopulation units). Due to some low prevalence, it can be assumed, that very likely no positive cases will be recorded while, occasionally, some occurrences will be observed. The overall model can be, for intance, expressed as
\[
Y_i | \boldsymbol{x}_i \sim \left\{
\begin{array}
00 & \textrm{with probability } p \in (0,1);\\
Poiss(\lambda_i) & \textrm{with probability } 1 - p_i.
\end{array} \right.
\] Here, the subpopulation characteristics (independent covariates) are given as \(\boldsymbol{x}_i\). In other words, the probability of observing no cases in a given subpopulation is \(P[Y_i = 0 | \boldsymbol{x}_i] = p + (1 - p) \cdot e^{- \boldsymbol{x}_i^\top \boldsymbol{\beta}}\) and the probability of observing \(k > 0\) cases can be expressed as \(P[Y_i = k | \boldsymbol{x}_i] = (1 - p) \cdot \frac{(\boldsymbol{x}_i^\top \boldsymbol{\beta})^k}{k!}e^{- \boldsymbol{x}_i^\top \boldsymbol{\beta}}\) where we assume a standard Poisson (GLM) regression to model the mean parameter \(\lambda_i = E[Y_i | \boldsymbol{x}_i] = \boldsymbol{x}_i^\top \boldsymbol{\beta}\). for some unknown vector of parameters \(\boldsymbol{\beta}\). The final model is obtained by maximizing the corresponding likelihood.
Such mixture models are very common in practical applications. The model above is ussually referred to as a zero-inflated Poisson regression model.
The canonical correlation analysis (CCA) is a multidimensional exploratory statistical method which (roughly) operates on the same principle as the principal component analysis (PCA). However, unlike the PCA where one tries to explore the common variability within the given data set, the canonical correlation is used to explore the mutual variability (correlations) between two different sets of quantitative variables observed on the same experimental units. Thus, the PCA method deals with one dataset only and it tries to reduce the overall dimensionality of the dataset using some linear combinations of the initial variables with the maximum variace possible while the CCA approach searhes for some linear combinations of the initial variables in both datasets in order to achieve maximum correlations amonng them instead.
For instance, for two multivariate random vectors \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) we assume that the common variance-covariance matrix of \((\boldsymbol{X}, \boldsymbol{Y})^\top\) is of the form \[ Var \left(\begin{array}{c}\boldsymbol{X}\\ \boldsymbol{Y}\end{array}\right) = \left(\begin{array}{cc} \Sigma_{XX} & \Sigma_{XY}\\ \Sigma_{YX} & \Sigma_{YY} \end{array}\right). \]
If \(\boldsymbol{X}\) is a \(p\) dimensional random vector and \(\boldsymbol{Y}\) is a \(q\) dimensional random vecton, then the corresponding correlation of any linear combination of \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) in terms of \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\), for \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\) can be expressed as \[ Cor(\boldsymbol{a}^\top \boldsymbol{X}, \boldsymbol{b}^\top\boldsymbol{Y}) = \boldsymbol{a}^\top \Sigma_{XY} \boldsymbol{b}. \]
Therefore, the canonical correlations analysis searches for such vectors \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\), that the previous correlation is as large as possible. However, the canonical correlation analysis does not search for just one pair of the canonical variables \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\) but, instead, it defines all pairs (up to \(min(p, q)\)), such that all canonical covariates within each dataset are uncorrelated among each other.
In the R statistical software there is function cancor()
available under the standard instalation. Some additional packages can be downloaded and installed as well (mainly the R packages ‘CCA’ and ‘vegan’).
In the following we will consider the same dataset we have already worked with before. The data consists of two datasets where each dataset represents measurements in some specific (64-65) river localities in the Czech republic. The first dataset contains measurements of biologial metrics for each locality (17 different metrics and taxons) and the secon dataset contains measurements on chemical concentrations and values at the same localities (7 covariates recored).
The idea is to somehow relate both dataset in order to explain what bio metrics can correlate which which chemical concentrations. One way of answering this is to apply the canonical correlation approach as explained above.
rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)
head(bioData)
## Kod_Canoco SaprInd Lital RETI EPTAbu Marg Metaritr JepAbu
## 1 BecvChor 2.57620 31.96916 0.37453 26.90136 7.10911 16.25647 14.89330
## 2 BecvOsek 2.43686 28.12037 0.27790 23.32888 5.80443 14.14831 11.06343
## 3 BecvTrou 2.10235 21.30691 0.31422 36.49575 6.14622 11.21903 30.23011
## 4 BelaBosk 1.53297 37.12914 0.48181 38.49802 9.36742 25.32146 10.11182
## 5 BilyPoto 1.79499 27.21495 0.37100 23.26918 8.62000 17.88813 5.95577
## 6 BlatTova 2.90209 16.04854 0.30351 8.64187 5.51661 7.98030 4.71039
## Epiritral Hyporitral Ntaxon Nceled Bindex EPTTax PosAbu Spasaci
## 1 7.68869 22.52242 18 40 0.64865 23 0.33939 28.29352
## 2 6.96659 19.74992 12 25 0.37838 16 0.00000 20.84029
## 3 5.92213 18.63735 9 25 0.16216 12 1.47017 20.88415
## 4 20.37168 18.17237 16 40 0.72000 30 8.07504 28.62651
## 5 12.08359 19.09771 21 36 0.54167 21 0.54567 22.02324
## 6 5.23533 14.60371 15 27 0.09091 5 0.00000 16.78539
## MDS_1 MDS_2
## 1 -0.69831500 -0.7040517
## 2 -0.61323430 -0.3796887
## 3 -0.08536265 1.6051440
## 4 0.55758620 -0.4424091
## 5 0.40888510 -0.2172988
## 6 -0.76699300 0.5244099
head(chemData)
## Kod_Canoco Tepl_max X.O2 BSK5 Kond N.NH4 N.NO3 Pcelk
## 1 BecvChor 20.9 107 1.3 33.7 0.09 1.55 0.077
## 2 BecvOsek 21.8 105 1.2 37.0 0.06 2.32 0.063
## 3 BecvTrou 22.1 106 1.4 43.9 0.07 2.49 0.080
## 4 BelaBosk 16.4 104 1.1 22.1 0.02 2.05 0.029
## 5 BilyPoto 18.9 104 1.9 30.4 0.10 4.48 0.131
## 6 BrezJaro 18.4 84 2.5 83.8 0.30 3.63 0.237
Again, there is one locality which is missing the dataset of chemical measurements. We identify this locality and we do not consider it for further analysis. Both datasets are alligned with respect to he locality names - the first column in each dataset.
ind <- match(chemData[,1], bioData[,1])
data <- data.frame(bioData[ind, ], chemData[, 2:8])
X <- data[,2:9]
Y <- data[,19:25]
A nice graphical tool to view both datasets with respect to their overall (within and between) correlation structure is avialable within the R package ‘CCA’ (Canonical Correlation Analysis). The packages needs to be firstly installed (install.packages("CCA")
) and later one can use functions matcor()
and img.matcor()
to graphically display the corration matrix
\[ cor(\mathcal{X}, \mathcal{Y}) = \left(\begin{align}\Sigma_{XX} & \Sigma_{XY}\\\Sigma_{YX} & \Sigma_{YY}\end{align}\right)\].
library("CCA")
correl <- matcor(X, Y )
img.matcor(correl, type = 2)
Next we can directly use cancor()
function to obtain canonical correlations for the datasets \(\mathcal{X}\) and \(\mathcal{Y}\). Another option is to use function cc()
(from the R package ‘CCA’). Note, the the results are equivalent although not identical.
cc1 <- cancor(X, Y) ### function from standard R instalation
cc2 <- cc(X, Y) ### function for the R package 'CCA'
The canonical correlations are the same for both approaches (R commands used)
cc1$cor ### function from standard R instalation
## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277
cc2$cor ### function for the R package 'CCA'
## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277
The canonical covariates can be easily reconstructred as follows:
CanX <- as.matrix(X) %*% cc1$xcoef
CanY <- as.matrix(Y) %*% cc1$ycoef
cor(CanX, CanY)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8.749611e-01 -5.641494e-17 -2.837332e-16 2.876720e-16 -2.755136e-16
## [2,] 1.969775e-18 7.049659e-01 6.045359e-17 3.672712e-16 2.833397e-16
## [3,] 2.055352e-15 -6.475529e-16 6.457969e-01 -8.881386e-16 -4.425705e-17
## [4,] 5.598600e-16 -4.649601e-16 -7.193021e-16 4.875075e-01 -1.535789e-16
## [5,] -6.529418e-16 3.505207e-16 1.774706e-16 -7.011448e-18 2.420530e-01
## [6,] 1.389060e-16 -3.532212e-17 -3.754766e-16 -4.455948e-16 5.118790e-16
## [7,] -8.693158e-16 4.868860e-16 -2.229543e-16 2.685094e-19 3.028397e-17
## [8,] -1.861723e-16 3.088850e-16 -3.129291e-16 -4.233271e-16 -4.250917e-16
## [,6] [,7]
## [1,] 2.598782e-17 8.001412e-17
## [2,] -4.062772e-17 -1.190124e-16
## [3,] 5.531278e-16 1.034287e-16
## [4,] -2.964996e-16 -6.147935e-17
## [5,] -2.659521e-16 1.510269e-16
## [6,] 1.156383e-01 8.425013e-18
## [7,] -4.480597e-16 4.481277e-02
## [8,] 6.107525e-16 2.362837e-16
Which corresponds with the output provided by the function cancor()
and cc()
above (see also the picture below for a visual comparison).
par(mfrow = c(1,2))
barplot(cc1$cor, main = "Canonical correlations for 'cancor()'", col = "gray")
barplot(cc2$cor, main = "Canonical correlations for 'cc()'", col = "gray")
However, the actual estimated linear combinations of the original covariates in both datasets \(\mathcal{X}\) and \(\mathcal{Y}\) are different. Compare the following:
cc1$xcoef ### function from standard R instalation
## [,1] [,2] [,3] [,4] [,5]
## SaprInd 0.162796345 0.089714234 -0.161737589 -0.491927540 -0.186875530
## Lital -0.003800701 0.001797172 -0.030965801 -0.007214761 0.017517637
## RETI 0.324209881 -0.055178843 0.693084411 -1.283570658 -2.507634813
## EPTAbu -0.002149837 -0.005077822 -0.009027554 0.008345945 0.005867073
## Marg -0.011912887 -0.037901448 -0.009666364 -0.065884327 -0.025639954
## Metaritr -0.012221662 0.001551030 0.043003790 0.012178243 -0.003013795
## JepAbu 0.005168862 0.006308244 0.004048909 -0.009547646 -0.019794148
## Epiritral 0.010432924 0.023391927 -0.013976612 -0.021306313 0.018722542
## [,6] [,7] [,8]
## SaprInd -0.098864028 0.06897063 -0.355850803
## Lital 0.004465308 -0.01327305 0.008943729
## RETI 3.047758363 1.14371548 0.285767761
## EPTAbu -0.005401486 0.01277878 -0.026098676
## Marg -0.029969261 0.01102501 -0.002459819
## Metaritr -0.027620185 -0.04963561 -0.056954743
## JepAbu -0.005346595 -0.00253416 0.030139906
## Epiritral -0.022760565 0.04740075 0.064087475
cc2$xcoef ### function for the R package 'CCA'
## [,1] [,2] [,3] [,4] [,5]
## SaprInd -1.29215593 -0.71208465 -1.28375231 -3.90455380 -1.48327854
## Lital 0.03016713 -0.01426461 -0.24578343 -0.05726539 0.13904193
## RETI -2.57333615 0.43796849 5.50118696 -10.18802625 -19.90373428
## EPTAbu 0.01706380 0.04030396 -0.07165399 0.06624389 0.04656845
## Marg 0.09455561 0.30083342 -0.07672439 -0.52294063 -0.20351082
## Metaritr 0.09700644 -0.01231092 0.34133200 0.09666181 -0.02392126
## JepAbu -0.04102657 -0.05007014 0.03213722 -0.07578209 -0.15711118
## Epiritral -0.08280877 -0.18566767 -0.11093592 -0.16911362 0.14860557
## [,6] [,7]
## SaprInd -0.78470889 0.54743742
## Lital 0.03544229 -0.10535158
## RETI 24.19083206 9.07796019
## EPTAbu -0.04287297 0.10142843
## Marg -0.23787363 0.08750829
## Metaritr -0.21922842 -0.39397046
## JepAbu -0.04243728 -0.02011427
## Epiritral -0.18065638 0.37623175
cc1$ycoef
## [,1] [,2] [,3] [,4] [,5]
## Tepl_max 0.002728344 -0.031587883 -0.022171030 0.015570969 -0.026678176
## X.O2 0.001145406 -0.004145546 0.005890339 -0.012452124 -0.009542703
## BSK5 0.015042715 0.054146231 -0.027996459 0.124516411 -0.026809781
## Kond 0.002545019 0.001425917 -0.004281998 0.001343352 0.001807199
## N.NH4 0.307216652 0.432469799 1.131659010 0.087843642 -0.818023849
## N.NO3 -0.010468525 -0.030934549 0.053453456 0.031622860 0.022842863
## Pcelk 0.412696634 -0.979367658 0.196048060 -3.071986907 0.807813337
## [,6] [,7]
## Tepl_max 0.024294508 0.0072354037
## X.O2 -0.021080517 -0.0001162876
## BSK5 -0.122653042 -0.1252271666
## Kond -0.003536074 0.0038599055
## N.NH4 0.123847014 0.5657683896
## N.NO3 -0.011571285 0.0045994691
## Pcelk 1.447735216 -1.1452167044
cc2$ycoef
## [,1] [,2] [,3] [,4] [,5]
## Tepl_max -0.021655562 0.25072104 -0.17597710 0.12359074 -0.21175146
## X.O2 -0.009091381 0.03290426 0.04675311 -0.09883567 -0.07574285
## BSK5 -0.119397849 -0.42977239 -0.22221500 0.98831837 -0.21279604
## Kond -0.020200461 -0.01131786 -0.03398730 0.01066252 0.01434420
## N.NH4 -2.438456583 -3.43262261 8.98226493 0.69723729 -6.49286301
## N.NO3 0.083091342 0.24553537 0.42427365 0.25099867 0.18130960
## Pcelk -3.275677981 7.77348979 1.55608323 -24.38314016 6.41181959
## [,6] [,7]
## Tepl_max 0.19283168 0.0574292361
## X.O2 -0.16732142 -0.0009230045
## BSK5 -0.97352834 -0.9939598209
## Kond -0.02806672 0.0306370503
## N.NH4 0.98300520 4.4906473759
## N.NO3 -0.09184423 0.0365071542
## Pcelk 11.49104203 -9.0898757913
Using the ‘CCA’ package one can also take an advantage of nice graphical tools which are available for the canonical correlation analysis. The idea is to display maximized correlations between transformed variables of the dataset \(\mathcal{X}\) and the dataset \(\mathcal{Y}\).
plt.cc(cc2, var.label = TRUE, ind.names = data[,1])
A usefull extension to the classical canonical correlation approach is the regularized version - so called regularized canonical correlation analysis. It is meant for scenarios where the number of parameters \(p \in \mathbb{N}\) is greater than the sample size \(n \in \mathbb{n}\) (usually \(p \gg n\)). There is R function rcc()
available in the R package ‘CCA’.
Another option how to perform canonical correlation in R is to install some additional R packages – for instance, package ‘vegan’ (use install.packages("vegan")
for instalation) and to apply function cca()
. Use the help session in R for more details about the function itself or the corresponding package respectively.
A similar method than the canonical correlation analysis, however, not symetric. The idea is tu use linear combinations of one dataset to maximize the corelation with the original covariates in the second dataset. Thus, it depends on which data set is used as the orginal one.
Finally, we shortly return back to the principal component analysis and we briefly discuss some more popular alternative. The PCA approach is mostly meant as an exploratory tool. But, what can it be used for after the data exploration is performed? One possible direction is to use the recovered principal components to explain the covariate of interest – a regression on the principal components. This is a classical regression techniques, however, instead of using the underlying covariates, one includes the principal components into the model instead.
Sometimes, it is also convenient to use both – the principal components for some dimensionality reduction and the original covariates for some better interpretation purposes. Indeed, regressing the dependent covariate with respect to the principal components introduces a huge challenge in terms of the following interpretation. Therefore, some alternative techniques for the variable selection and data dimensionality reduction are recently considered in statistics.
A significant dimensional reduction and, at the same time a variable selection technique, is introdused by a reqularized regression based on aLASSO penalization. LASSO stands for the Least Absolute Selection and Shrinkage Operator and it was introduced by Tibshirani (1996). The main idea of the LASSO reqularized (penalized) regression is the following one: instead of minimizing
\[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2}, \]
with respect to the vector of regression coefficients \(\boldsymbol{\beta} \in \mathbb{R}^{p}\) for \(p < n\) once considers an alternative minimization defined as
\[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2} + \lambda \|\boldsymbol{\beta}\|_{1}, \]
where \(\lambda > 0\) is some reqularization parameter and \(\|\cdot\|_{1}\) stands for a classical \(L_{1}\) norm. It is obvious that the larger the value of the regularization parameter \(\lambda > 0\) the more dominant the second term in the minimization problem above. Thus, the elements of \(\boldsymbol{\beta}\) are shrunk towards zero. Actually, for large values of \(\lambda \to \infty\) the most elements of the vector of the regression parameters will be set to zero directly. Such solution is called a sparse solution (the vector of parameter estimates is sparse, it contains non-zero elements very rarely).
On the other hand, for \(\lambda \to 0\) the first term in the minimization problem is dominant and thus, for \(\lambda = 0\) the whole problem reduces to a classical least squares regression.
The main advantage of the LASSO regularized regression approach is that it also allows for heavily overparametrized models (situations where \(p \gg n\)). Such problems commonly occur in genetic data, sociology surveys, econometrics and basically all other areas…
Unlike the classical least squares problem where the vector of the parameter estimates is given as a solution of the normal equations and thus, is explicit, no explicit solution is generaly given for the LASSO regression. But…
A very elegant solution to the LASSO regression fitting problem was obtained once we realized that the solution paths are piece-wise linear. Given this, one only needs to calculate the knot point positions and the rest of the path is a linear approximation only.
Using the LARS-LASSO algorithm proposed by Efron et al. (2004) we can do that in a very effective way. Actually in \(p\) linear steps only!!!
The whole algorithm is based on a geometric interpretation of covariances between the current response estimate and the covariates still not included in the model yet.
source("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/larsLASSO.r")
In the sourse code loaded above there are two functions for R: solutionPath()
and solutionPathPlot()
.
solutionPath(X, Y, noJumps)
- calculates the complete solution paths for the first ‘noJumps’ parameters entering the model. The value of ‘noJumps’ is an integer value between one and the second dimension of \(X\) (the total number of predictors). The matrix \(X\) stands for a classical regression matrix of the type \(n \times p\) however, even situations for \(p \gg n\) are allowed. Finally, the vector of \(\boldsymbol{Y}\) is a response vector. As a result one obtains a list with two elements: ‘betaVector’ with augmented parameter estimates in each row and ‘corrMatrix’ which us a matrix with actual correlations of the actuall reponse estimate and the independent covariates.solutionPathPlot(solutionVector, corrMatrix, xSeq)
- a function which creates two plots: a complete solution path until ‘noJump’ number of nonzero predictors enter the model and the plot with actual correlation decrease at each step.
Y <- data$Pcelk
X <- as.matrix(data[,5:18])
LARSsolution<- solutionPath(X, Y, 10)
betaVector <- LARSsolution[[1]]
corrMatrix <- LARSsolution[[2]]
And the graphical output can be obtained using this code:
par(mfrow = c(2,1))
solutionPathPlot(betaVector, corrMatrix, Y, data[,1])
There are few usefull libraries in R which implement the LASSO shrinkage method and the LARS algorithm too. In addition, there were various modifications of the LASSO estimation approach proposed in the literature. Eventhough, the LASSO is a very effective variable selection tool it still suffers of some issues and disadvantages. Most of the proposed modifications aim to improve the LASSO performance especially with respect to the selection performance and consistency (for instance, to achieve the ORACLE properties or sign consistency instead).
The most convenient package, which offers, beside the LASSO regression modeling, also the Ridge regression and the elastic net algorithms is implemented in the library glmnet
(use the command install.packages("glmnet")
for installation and library("glmnent")
for initialization in the R environment).
library("glmnet")
## Loaded glmnet 3.0-2
fit1=glmnet(X,Y)
plot(fit1)
coef(fit1,s=0.01)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.037990e-01
## EPTAbu -5.509023e-05
## Marg .
## Metaritr -1.814878e-03
## JepAbu .
## Epiritral .
## Hyporitral -1.167139e-03
## Ntaxon .
## Nceled .
## Bindex .
## EPTTax -1.954037e-03
## PosAbu .
## Spasaci .
## MDS_1 -2.188195e-02
## MDS_2 .
Another library, which offers LARS fitting appraoch for various models (for instance, LASSO, Least Angle Regresssion, Forward Stagewise, or stepwise regression) is implemented in the package lars
(run the command install.packages("lars")
to install the library and library("lars")
to initialize it in the R environment).
See the next examples for a brief illustration:
library("lars")
object1 <- lars(X,Y, type = "lasso")
plot(object1)
object2 <- lars(X,Y, type = "lar")
plot(object2)
object3 <- lars(X,Y, type = "forward.stagewise")
plot(object3)
object4 <- lars(X,Y, type = "stepwise")
plot(object4)
Use the command below to instal the SMSdata
packages with various multivariate datasets:
install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")
Chose one dataset of your choise from the list of all available datasets in the package:
data(package = "SMSdata")
There are 21 different datasets and you can load each of them by typing and running data(dataset_name)
, for instance:
data(plasma)
The final PDF document with all partial tasks assigned during the term should be finalized and submitted by email to maciak [AT] karlin.mff.cuni.cz by Friday, May 22nd, 2020. This only applies to students who intend to obtain the course credit.