NMST539 | Lab Session 10

Mixture models, correspondance analysis and variable selection techniques

Summer Term 2019/2020 | May 11th - May 15th, 2020 (week 13)

Rmd file (UTF8 coding)

The R-software is available for download from the official website: https://www.r-project.org

A user-friendly interface (one of many): RStudio

Some useful manuals and tutorials for R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

1. Mixture models

We have already discussed some clustering techniques where the available observations are assigned to some groups with respec to their mutual similarities (depending on the similarity measure or the distance which is used to assess the differences among objetcs).

However, a fundamental model for clustering – a mixture model – is more complex than that. In general, it is standardly assumed that the observations are randomly drawn from K different populations but instead of providing just the corresponding labels for the observations (labeling observations accorging to their assignment to the groups) one estimates the whole population with the mixture model. Thus, a much more complex information is obtained as a results. On the other hand, more assumptions are needed as well.

Let us consider the data on the consumption of automobiles in the United states in 80’s (the dataset ‘mtcars’ available in R). The data consists of 32 observations (32 different cars on the US market available that time) and 11 covariates (some of them are continious, some of them are categorical). For an additional information on this dataset use the help session in R (type ?mtcars).

rm(list = ls())
attach(mtcars)
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Using the given data, we can apriori assume three different populations – for instance, the populations determined by the number of cylinders (four, six, and eight cylinder cars). Each population has its own specific characteristics – such as the mean, variance, distribution, or density. Considering, for instance, the car consumption (miles per gallon) we can estimate the proportion of each population in the given sample and, imposing the normality condition with the unit variance, we can estimate the overall mixture model by estimating the corresponding mean values and plugging everything all together. The estimated mixture distribution is the following:

### population mean estimates
mean1 <- mean(mpg[cyl == 4])
mean2 <- mean(mpg[cyl == 6])
mean3 <- mean(mpg[cyl == 8])

### estimated proportions for each population
prop <- table(cyl)/dim(mtcars)[1]

### the overall mixture distribution
mixture <- function(x, prop, means){
  mixValue <- 0
  for (i in 1:length(prop)){
    mixValue <- mixValue + prop[i] * dnorm(x, means[i])
  }
  return(mixValue)
}

xSeq <- seq(10, 34, length = 100)
density <- mixture(xSeq, prop, c(mean1, mean2, mean3))
plot(density ~ xSeq, col = "red", type = "l", xlab = "Miles per Gallon", ylab = "Density")

The mixture is composed from three mixtures (normal distributions with the unit variance and some corresponding mean parameters).

plot(density ~ xSeq, col = "red", type = "l", lwd = 3, xlab = "Miles per Gallon", ylab = "Density")
d1 <- density(rnorm(10000, mean1, 1))
d2 <- density(rnorm(10000, mean2, 1))
d3 <- density(rnorm(10000, mean3, 1))
lines(d1$x, prop[1] * d1$y, col = "black", lwd = 2)
lines(d2$x, prop[2] * d2$y, col = "violet", lwd = 2)
lines(d3$x, prop[3] * d3$y, col = "gray", lwd = 2)

The joint density (mixture distribution) has multiple nodes – one for each cluster mean \(\mu_k\), for \(k = 1, 2, 3\). However, for two centroids which are close enough, it is possible that only one node will be present in the final mixture.

In practical problems the mixture estimation becomes way more complex. Usually, it is common that the number of populations included in the mixture is not known apriori or the variance can differ across different populations.

Maximum likelihood methods are typically used to estimate the overall mixture models but some regularity conditions (restrictions) are required to obtain some reasonable solutions at the end.

Do by yourself

What will happen, if you assume the same approach as above (a normal mixture model with unit variance), however, with no restriction regarding the number of the underlying populations? The minimization problem can be formulated as \[ Maximize \prod_{k = 1}^{K} \prod_{i = 1}^{n_k} f_k(X_{k i}) \equiv Maximize \prod_{k = 1}^{K} \prod_{i = 1}^{n_k} \frac{1}{\sqrt{2 \pi}} \cdot exp \left\{\frac{-1}{2} (X_{k i} - \mu_k)^2\right\} \] where the maximization takes place with respect to \(K \in \{1, \dots, n\}\), and \(\mu_1, \dots, \mu_K \in \mathbb{R}\) (with \(n \in \mathbb{N}\) being the overall number of observations and \(n_k \in \mathbb{N}\) being the number of subjects from the \(k\)-th population – observations \(X_{k 1}, \dots, X_{k n_k}\)). If no further restrictions are defined, the final mixture model will look like this:
```
plot(density ~ xSeq, col = "gray", type = "l", lwd = 2, xlab = "Miles per Gallon", ylab = "Density", lty = 2, ylim = c(0,0.5))
for (i in 1:length(mpg)){
  lines(rep(mpg[i], 2), c(dnorm(0), 0), col = "red", lwd = 1, lty = 3)
  points(mpg[i], dnorm(0), pch = 21, cex = 0.5, bg = "red")
}
```
Can you explain what you see in the figure above? What will happen if you run the maximization problem with respect to the unknown variance parameters \(\sigma_k^2 > 0\), for \(k = 1, \dots, K\)?
How would you utilize the overall mixture model for clustering? How would you define the decision algorithm if you need to assign some new observation into one of three clusters?
Try to improve the mixture distribution, for instance, by considering heterogenous variances. Compare the results under homoscedasticity and heteroscedasticity. Define the overall likelihood and try to maximize it to obtain the final mixture model.
The mixture can be further improved by considering some other than normal distribution. On the other hand, the normal distribution is quite convenient from some theoretical point of view. Some prametric distribution family is usually assumed as it allows to express the overall mixture structure in terms of the likelihood. If, in addition, the number of underlying populations is given as well, it improves the calculations significantly.
One needs to be carefull with the mixture estimation. Usually some regularization techniques are needed to avoid a so-called “Dirac catastrophe”. The regularization approaches can be, however, implemented differently (for instance, instead of assuming the number of pupulations, one can assume some pre-defined shape for the final mixture – such as unimodality, or log-concavity, etc.).

Mixture models are also used in a regression modelling. As an illustration, consider a GLM model (with a logarithmic link) for Poisson counts (for instance, modelling the number of positive COVID tests in some subpopulation units). Due to some low prevalence, it can be assumed, that very likely no positive cases will be recorded while, occasionally, some occurrences will be observed. The overall model can be, for intance, expressed as

\[ Y_i | \boldsymbol{x}_i \sim \left\{ \begin{array} 00 & \textrm{with probability } p \in (0,1);\\ Poiss(\lambda_i) & \textrm{with probability } 1 - p_i. \end{array} \right. \] Here, the subpopulation characteristics (independent covariates) are given as \(\boldsymbol{x}_i\). In other words, the probability of observing no cases in a given subpopulation is \(P[Y_i = 0 | \boldsymbol{x}_i] = p + (1 - p) \cdot e^{- \boldsymbol{x}_i^\top \boldsymbol{\beta}}\) and the probability of observing \(k > 0\) cases can be expressed as \(P[Y_i = k | \boldsymbol{x}_i] = (1 - p) \cdot \frac{(\boldsymbol{x}_i^\top \boldsymbol{\beta})^k}{k!}e^{- \boldsymbol{x}_i^\top \boldsymbol{\beta}}\) where we assume a standard Poisson (GLM) regression to model the mean parameter \(\lambda_i = E[Y_i | \boldsymbol{x}_i] = \boldsymbol{x}_i^\top \boldsymbol{\beta}\). for some unknown vector of parameters \(\boldsymbol{\beta}\). The final model is obtained by maximizing the corresponding likelihood.
Such mixture models are very common in practical applications. The model above is ussually referred to as a zero-inflated Poisson regression model.

2. Canonical Correlation Analysis in R

The canonical correlation analysis (CCA) is a multidimensional exploratory statistical method which (roughly) operates on the same principle as the principal component analysis (PCA). However, unlike the PCA where one tries to explore the common variability within the given data set, the canonical correlation is used to explore the mutual variability (correlations) between two different sets of quantitative variables observed on the same experimental units. Thus, the PCA method deals with one dataset only and it tries to reduce the overall dimensionality of the dataset using some linear combinations of the initial variables with the maximum variace possible while the CCA approach searhes for some linear combinations of the initial variables in both datasets in order to achieve maximum correlations amonng them instead.

For instance, for two multivariate random vectors \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) we assume that the common variance-covariance matrix of \((\boldsymbol{X}, \boldsymbol{Y})^\top\) is of the form \[ Var \left(\begin{array}{c}\boldsymbol{X}\\ \boldsymbol{Y}\end{array}\right) = \left(\begin{array}{cc} \Sigma_{XX} & \Sigma_{XY}\\ \Sigma_{YX} & \Sigma_{YY} \end{array}\right). \]

If \(\boldsymbol{X}\) is a \(p\) dimensional random vector and \(\boldsymbol{Y}\) is a \(q\) dimensional random vecton, then the corresponding correlation of any linear combination of \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) in terms of \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\), for \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\) can be expressed as \[ Cor(\boldsymbol{a}^\top \boldsymbol{X}, \boldsymbol{b}^\top\boldsymbol{Y}) = \boldsymbol{a}^\top \Sigma_{XY} \boldsymbol{b}. \]

Therefore, the canonical correlations analysis searches for such vectors \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\), that the previous correlation is as large as possible. However, the canonical correlation analysis does not search for just one pair of the canonical variables \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\) but, instead, it defines all pairs (up to \(min(p, q)\)), such that all canonical covariates within each dataset are uncorrelated among each other.

In the R statistical software there is function cancor() available under the standard instalation. Some additional packages can be downloaded and installed as well (mainly the R packages ‘CCA’ and ‘vegan’).

In the following we will consider the same dataset we have already worked with before. The data consists of two datasets where each dataset represents measurements in some specific (64-65) river localities in the Czech republic. The first dataset contains measurements of biologial metrics for each locality (17 different metrics and taxons) and the secon dataset contains measurements on chemical concentrations and values at the same localities (7 covariates recored).

The idea is to somehow relate both dataset in order to explain what bio metrics can correlate which which chemical concentrations. One way of answering this is to apply the canonical correlation approach as explained above.

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

head(bioData)

##   Kod_Canoco SaprInd    Lital    RETI   EPTAbu    Marg Metaritr   JepAbu
## 1   BecvChor 2.57620 31.96916 0.37453 26.90136 7.10911 16.25647 14.89330
## 2   BecvOsek 2.43686 28.12037 0.27790 23.32888 5.80443 14.14831 11.06343
## 3   BecvTrou 2.10235 21.30691 0.31422 36.49575 6.14622 11.21903 30.23011
## 4   BelaBosk 1.53297 37.12914 0.48181 38.49802 9.36742 25.32146 10.11182
## 5   BilyPoto 1.79499 27.21495 0.37100 23.26918 8.62000 17.88813  5.95577
## 6   BlatTova 2.90209 16.04854 0.30351  8.64187 5.51661  7.98030  4.71039
##   Epiritral Hyporitral Ntaxon Nceled  Bindex EPTTax  PosAbu  Spasaci
## 1   7.68869   22.52242     18     40 0.64865     23 0.33939 28.29352
## 2   6.96659   19.74992     12     25 0.37838     16 0.00000 20.84029
## 3   5.92213   18.63735      9     25 0.16216     12 1.47017 20.88415
## 4  20.37168   18.17237     16     40 0.72000     30 8.07504 28.62651
## 5  12.08359   19.09771     21     36 0.54167     21 0.54567 22.02324
## 6   5.23533   14.60371     15     27 0.09091      5 0.00000 16.78539
##         MDS_1      MDS_2
## 1 -0.69831500 -0.7040517
## 2 -0.61323430 -0.3796887
## 3 -0.08536265  1.6051440
## 4  0.55758620 -0.4424091
## 5  0.40888510 -0.2172988
## 6 -0.76699300  0.5244099

head(chemData)

##   Kod_Canoco Tepl_max X.O2 BSK5 Kond N.NH4 N.NO3 Pcelk
## 1   BecvChor     20.9  107  1.3 33.7  0.09  1.55 0.077
## 2   BecvOsek     21.8  105  1.2 37.0  0.06  2.32 0.063
## 3   BecvTrou     22.1  106  1.4 43.9  0.07  2.49 0.080
## 4   BelaBosk     16.4  104  1.1 22.1  0.02  2.05 0.029
## 5   BilyPoto     18.9  104  1.9 30.4  0.10  4.48 0.131
## 6   BrezJaro     18.4   84  2.5 83.8  0.30  3.63 0.237

Again, there is one locality which is missing the dataset of chemical measurements. We identify this locality and we do not consider it for further analysis. Both datasets are alligned with respect to he locality names - the first column in each dataset.

ind <- match(chemData[,1], bioData[,1])
data <- data.frame(bioData[ind, ], chemData[, 2:8])

X <- data[,2:9]
Y <- data[,19:25]

A nice graphical tool to view both datasets with respect to their overall (within and between) correlation structure is avialable within the R package ‘CCA’ (Canonical Correlation Analysis). The packages needs to be firstly installed (install.packages("CCA")) and later one can use functions matcor() and img.matcor() to graphically display the corration matrix

\[ cor(\mathcal{X}, \mathcal{Y}) = \left(\begin{align}\Sigma_{XX} & \Sigma_{XY}\\\Sigma_{YX} & \Sigma_{YY}\end{align}\right)\].

library("CCA")
correl <- matcor(X, Y )
img.matcor(correl, type = 2)

Next we can directly use cancor() function to obtain canonical correlations for the datasets \(\mathcal{X}\) and \(\mathcal{Y}\). Another option is to use function cc() (from the R package ‘CCA’). Note, the the results are equivalent although not identical.

cc1 <- cancor(X, Y)  ### function from standard R instalation
cc2 <- cc(X, Y)      ### function for the R package 'CCA'

The canonical correlations are the same for both approaches (R commands used)

cc1$cor  ### function from standard R instalation

## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277

cc2$cor  ### function for the R package 'CCA'

## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277

The canonical covariates can be easily reconstructred as follows:

CanX <- as.matrix(X) %*% cc1$xcoef
CanY <- as.matrix(Y) %*% cc1$ycoef

cor(CanX, CanY)

##               [,1]          [,2]          [,3]          [,4]          [,5]
## [1,]  8.749611e-01 -5.641494e-17 -2.837332e-16  2.876720e-16 -2.755136e-16
## [2,]  1.969775e-18  7.049659e-01  6.045359e-17  3.672712e-16  2.833397e-16
## [3,]  2.055352e-15 -6.475529e-16  6.457969e-01 -8.881386e-16 -4.425705e-17
## [4,]  5.598600e-16 -4.649601e-16 -7.193021e-16  4.875075e-01 -1.535789e-16
## [5,] -6.529418e-16  3.505207e-16  1.774706e-16 -7.011448e-18  2.420530e-01
## [6,]  1.389060e-16 -3.532212e-17 -3.754766e-16 -4.455948e-16  5.118790e-16
## [7,] -8.693158e-16  4.868860e-16 -2.229543e-16  2.685094e-19  3.028397e-17
## [8,] -1.861723e-16  3.088850e-16 -3.129291e-16 -4.233271e-16 -4.250917e-16
##               [,6]          [,7]
## [1,]  2.598782e-17  8.001412e-17
## [2,] -4.062772e-17 -1.190124e-16
## [3,]  5.531278e-16  1.034287e-16
## [4,] -2.964996e-16 -6.147935e-17
## [5,] -2.659521e-16  1.510269e-16
## [6,]  1.156383e-01  8.425013e-18
## [7,] -4.480597e-16  4.481277e-02
## [8,]  6.107525e-16  2.362837e-16

Which corresponds with the output provided by the function cancor() and cc()above (see also the picture below for a visual comparison).

par(mfrow = c(1,2))
barplot(cc1$cor, main = "Canonical correlations for 'cancor()'", col = "gray")
barplot(cc2$cor, main = "Canonical correlations for 'cc()'", col = "gray")

However, the actual estimated linear combinations of the original covariates in both datasets \(\mathcal{X}\) and \(\mathcal{Y}\) are different. Compare the following:

cc1$xcoef  ### function from standard R instalation

##                   [,1]         [,2]         [,3]         [,4]         [,5]
## SaprInd    0.162796345  0.089714234 -0.161737589 -0.491927540 -0.186875530
## Lital     -0.003800701  0.001797172 -0.030965801 -0.007214761  0.017517637
## RETI       0.324209881 -0.055178843  0.693084411 -1.283570658 -2.507634813
## EPTAbu    -0.002149837 -0.005077822 -0.009027554  0.008345945  0.005867073
## Marg      -0.011912887 -0.037901448 -0.009666364 -0.065884327 -0.025639954
## Metaritr  -0.012221662  0.001551030  0.043003790  0.012178243 -0.003013795
## JepAbu     0.005168862  0.006308244  0.004048909 -0.009547646 -0.019794148
## Epiritral  0.010432924  0.023391927 -0.013976612 -0.021306313  0.018722542
##                   [,6]        [,7]         [,8]
## SaprInd   -0.098864028  0.06897063 -0.355850803
## Lital      0.004465308 -0.01327305  0.008943729
## RETI       3.047758363  1.14371548  0.285767761
## EPTAbu    -0.005401486  0.01277878 -0.026098676
## Marg      -0.029969261  0.01102501 -0.002459819
## Metaritr  -0.027620185 -0.04963561 -0.056954743
## JepAbu    -0.005346595 -0.00253416  0.030139906
## Epiritral -0.022760565  0.04740075  0.064087475

cc2$xcoef  ### function for the R package 'CCA'

##                  [,1]        [,2]        [,3]         [,4]         [,5]
## SaprInd   -1.29215593 -0.71208465 -1.28375231  -3.90455380  -1.48327854
## Lital      0.03016713 -0.01426461 -0.24578343  -0.05726539   0.13904193
## RETI      -2.57333615  0.43796849  5.50118696 -10.18802625 -19.90373428
## EPTAbu     0.01706380  0.04030396 -0.07165399   0.06624389   0.04656845
## Marg       0.09455561  0.30083342 -0.07672439  -0.52294063  -0.20351082
## Metaritr   0.09700644 -0.01231092  0.34133200   0.09666181  -0.02392126
## JepAbu    -0.04102657 -0.05007014  0.03213722  -0.07578209  -0.15711118
## Epiritral -0.08280877 -0.18566767 -0.11093592  -0.16911362   0.14860557
##                  [,6]        [,7]
## SaprInd   -0.78470889  0.54743742
## Lital      0.03544229 -0.10535158
## RETI      24.19083206  9.07796019
## EPTAbu    -0.04287297  0.10142843
## Marg      -0.23787363  0.08750829
## Metaritr  -0.21922842 -0.39397046
## JepAbu    -0.04243728 -0.02011427
## Epiritral -0.18065638  0.37623175

cc1$ycoef

##                  [,1]         [,2]         [,3]         [,4]         [,5]
## Tepl_max  0.002728344 -0.031587883 -0.022171030  0.015570969 -0.026678176
## X.O2      0.001145406 -0.004145546  0.005890339 -0.012452124 -0.009542703
## BSK5      0.015042715  0.054146231 -0.027996459  0.124516411 -0.026809781
## Kond      0.002545019  0.001425917 -0.004281998  0.001343352  0.001807199
## N.NH4     0.307216652  0.432469799  1.131659010  0.087843642 -0.818023849
## N.NO3    -0.010468525 -0.030934549  0.053453456  0.031622860  0.022842863
## Pcelk     0.412696634 -0.979367658  0.196048060 -3.071986907  0.807813337
##                  [,6]          [,7]
## Tepl_max  0.024294508  0.0072354037
## X.O2     -0.021080517 -0.0001162876
## BSK5     -0.122653042 -0.1252271666
## Kond     -0.003536074  0.0038599055
## N.NH4     0.123847014  0.5657683896
## N.NO3    -0.011571285  0.0045994691
## Pcelk     1.447735216 -1.1452167044

cc2$ycoef

##                  [,1]        [,2]        [,3]         [,4]        [,5]
## Tepl_max -0.021655562  0.25072104 -0.17597710   0.12359074 -0.21175146
## X.O2     -0.009091381  0.03290426  0.04675311  -0.09883567 -0.07574285
## BSK5     -0.119397849 -0.42977239 -0.22221500   0.98831837 -0.21279604
## Kond     -0.020200461 -0.01131786 -0.03398730   0.01066252  0.01434420
## N.NH4    -2.438456583 -3.43262261  8.98226493   0.69723729 -6.49286301
## N.NO3     0.083091342  0.24553537  0.42427365   0.25099867  0.18130960
## Pcelk    -3.275677981  7.77348979  1.55608323 -24.38314016  6.41181959
##                 [,6]          [,7]
## Tepl_max  0.19283168  0.0574292361
## X.O2     -0.16732142 -0.0009230045
## BSK5     -0.97352834 -0.9939598209
## Kond     -0.02806672  0.0306370503
## N.NH4     0.98300520  4.4906473759
## N.NO3    -0.09184423  0.0365071542
## Pcelk    11.49104203 -9.0898757913

Using the ‘CCA’ package one can also take an advantage of nice graphical tools which are available for the canonical correlation analysis. The idea is to display maximized correlations between transformed variables of the dataset \(\mathcal{X}\) and the dataset \(\mathcal{Y}\).

plt.cc(cc2, var.label = TRUE, ind.names = data[,1])

Questions

How would you interpret the left hand side of the figure?
What is the relationship between the subfigure on the left hand side and the subfigure on the right hand side?
What would be your conclussion about the two available dataset and their mutual relationship?

A usefull extension to the classical canonical correlation approach is the regularized version - so called regularized canonical correlation analysis. It is meant for scenarios where the number of parameters \(p \in \mathbb{N}\) is greater than the sample size \(n \in \mathbb{n}\) (usually \(p \gg n\)). There is R function rcc() available in the R package ‘CCA’.

Another option how to perform canonical correlation in R is to install some additional R packages – for instance, package ‘vegan’ (use install.packages("vegan") for instalation) and to apply function cca(). Use the help session in R for more details about the function itself or the corresponding package respectively.

To Do

Use this markdown and consider (maybe) the whole dataset on biologial measurements. Compare how much the results changes.
Consider a lower dimensional profile and try to refit the whole canonical correlation analysis on a lower dimensional datasets (e.g. two covariates for bio data and two covariates for chemical data).
Try to manually reconstruct the results from the above by using the correponding canonical correlation theory form the lecture.

Note

Canonical Correspondance Analysis
An analogous method that the canonical correlation analysis, however, mainly meant for discrete data.
Redundancy Analysis

A similar method than the canonical correlation analysis, however, not symetric. The idea is tu use linear combinations of one dataset to maximize the corelation with the original covariates in the second dataset. Thus, it depends on which data set is used as the orginal one.

3. Variable selection via LASSO and other techniques

Finally, we shortly return back to the principal component analysis and we briefly discuss some more popular alternative. The PCA approach is mostly meant as an exploratory tool. But, what can it be used for after the data exploration is performed? One possible direction is to use the recovered principal components to explain the covariate of interest – a regression on the principal components. This is a classical regression techniques, however, instead of using the underlying covariates, one includes the principal components into the model instead.

Sometimes, it is also convenient to use both – the principal components for some dimensionality reduction and the original covariates for some better interpretation purposes. Indeed, regressing the dependent covariate with respect to the principal components introduces a huge challenge in terms of the following interpretation. Therefore, some alternative techniques for the variable selection and data dimensionality reduction are recently considered in statistics.

A significant dimensional reduction and, at the same time a variable selection technique, is introdused by a reqularized regression based on aLASSO penalization. LASSO stands for the Least Absolute Selection and Shrinkage Operator and it was introduced by Tibshirani (1996). The main idea of the LASSO reqularized (penalized) regression is the following one: instead of minimizing

\[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2}, \]

with respect to the vector of regression coefficients \(\boldsymbol{\beta} \in \mathbb{R}^{p}\) for \(p < n\) once considers an alternative minimization defined as

\[ \|\boldsymbol{Y} - \mathbb{X}\boldsymbol{\beta}\|_{2}^{2} + \lambda \|\boldsymbol{\beta}\|_{1}, \]

where \(\lambda > 0\) is some reqularization parameter and \(\|\cdot\|_{1}\) stands for a classical \(L_{1}\) norm. It is obvious that the larger the value of the regularization parameter \(\lambda > 0\) the more dominant the second term in the minimization problem above. Thus, the elements of \(\boldsymbol{\beta}\) are shrunk towards zero. Actually, for large values of \(\lambda \to \infty\) the most elements of the vector of the regression parameters will be set to zero directly. Such solution is called a sparse solution (the vector of parameter estimates is sparse, it contains non-zero elements very rarely).

On the other hand, for \(\lambda \to 0\) the first term in the minimization problem is dominant and thus, for \(\lambda = 0\) the whole problem reduces to a classical least squares regression.

The main advantage of the LASSO regularized regression approach is that it also allows for heavily overparametrized models (situations where \(p \gg n\)). Such problems commonly occur in genetic data, sociology surveys, econometrics and basically all other areas…

Unlike the classical least squares problem where the vector of the parameter estimates is given as a solution of the normal equations and thus, is explicit, no explicit solution is generaly given for the LASSO regression. But…

3.1. LARS-LASSO Algorithm

A very elegant solution to the LASSO regression fitting problem was obtained once we realized that the solution paths are piece-wise linear. Given this, one only needs to calculate the knot point positions and the rest of the path is a linear approximation only.

Using the LARS-LASSO algorithm proposed by Efron et al. (2004) we can do that in a very effective way. Actually in \(p\) linear steps only!!!

The whole algorithm is based on a geometric interpretation of covariances between the current response estimate and the covariates still not included in the model yet.

source("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/larsLASSO.r")

In the sourse code loaded above there are two functions for R: solutionPath() and solutionPathPlot().

solutionPath(X, Y, noJumps) - calculates the complete solution paths for the first ‘noJumps’ parameters entering the model. The value of ‘noJumps’ is an integer value between one and the second dimension of \(X\) (the total number of predictors). The matrix \(X\) stands for a classical regression matrix of the type \(n \times p\) however, even situations for \(p \gg n\) are allowed. Finally, the vector of \(\boldsymbol{Y}\) is a response vector. As a result one obtains a list with two elements: ‘betaVector’ with augmented parameter estimates in each row and ‘corrMatrix’ which us a matrix with actual correlations of the actuall reponse estimate and the independent covariates.
solutionPathPlot(solutionVector, corrMatrix, xSeq) - a function which creates two plots: a complete solution path until ‘noJump’ number of nonzero predictors enter the model and the plot with actual correlation decrease at each step.

Y <- data$Pcelk
X <- as.matrix(data[,5:18])
LARSsolution<- solutionPath(X, Y, 10)

betaVector <- LARSsolution[[1]]
corrMatrix <- LARSsolution[[2]]

And the graphical output can be obtained using this code:

par(mfrow = c(2,1))
solutionPathPlot(betaVector, corrMatrix, Y, data[,1])

Cons and Prons

The main advantage of the LASSO regularized regression approach is that it allows for very high dimensions to be effectively taken care of. Indeed, with the LASSO reqularized regression there is no problem to solve regression problems with \(p \gg n\). However, the number of nonzero parameters at the end (in the final model) is \(n\) at most. Irrelevant regressors are set to zero.
Instead of one set of estimates one can (with the same time efficiency) obtain the whole path of solutions. This makes the LASSO selection and estimation a very effective tool for high dimensional problems.
Probably the main disadvantage is that the final estimates are shrunked towards zero. There were different alternatives proposed to correct for this bias. The most popular approach is to use LASSO for the variable selection only and afterwards one should perform a classical linear regression on the selected covariates only.
Another disadvantage is a lack of inference for these models (see e.g. post-selection inference in the literature).

3.2. LASSO Methods in R

There are few usefull libraries in R which implement the LASSO shrinkage method and the LARS algorithm too. In addition, there were various modifications of the LASSO estimation approach proposed in the literature. Eventhough, the LASSO is a very effective variable selection tool it still suffers of some issues and disadvantages. Most of the proposed modifications aim to improve the LASSO performance especially with respect to the selection performance and consistency (for instance, to achieve the ORACLE properties or sign consistency instead).

The most convenient package, which offers, beside the LASSO regression modeling, also the Ridge regression and the elastic net algorithms is implemented in the library glmnet (use the command install.packages("glmnet") for installation and library("glmnent") for initialization in the R environment).

library("glmnet")

## Loaded glmnet 3.0-2

fit1=glmnet(X,Y)
plot(fit1)

coef(fit1,s=0.01)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  2.037990e-01
## EPTAbu      -5.509023e-05
## Marg         .           
## Metaritr    -1.814878e-03
## JepAbu       .           
## Epiritral    .           
## Hyporitral  -1.167139e-03
## Ntaxon       .           
## Nceled       .           
## Bindex       .           
## EPTTax      -1.954037e-03
## PosAbu       .           
## Spasaci      .           
## MDS_1       -2.188195e-02
## MDS_2        .

Another library, which offers LARS fitting appraoch for various models (for instance, LASSO, Least Angle Regresssion, Forward Stagewise, or stepwise regression) is implemented in the package lars (run the command install.packages("lars") to install the library and library("lars") to initialize it in the R environment).

See the next examples for a brief illustration:

library("lars")

object1 <- lars(X,Y, type = "lasso")
plot(object1)

object2 <- lars(X,Y, type = "lar")
plot(object2)

object3 <- lars(X,Y, type = "forward.stagewise")
plot(object3)

object4 <- lars(X,Y, type = "stepwise")
plot(object4)

Homework Assignment

(Deadline: End of the summer term | 22.05.2020)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata")

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma)

Define a reasonable mixture model and estimate the model by using the maximum likelihood approach; (you are free to use some apriori defined number of populations but try to use some reasonable mixture model – for instance, use some prior knowledge or some exploratory analysis to determin the underlying number of populations)
Provide some grafical output for the estimated mixture model and discuss the results (interpret the model);

The final PDF document with all partial tasks assigned during the term should be finalized and submitted by email to maciak [AT] karlin.mff.cuni.cz by Friday, May 22nd, 2020. This only applies to students who intend to obtain the course credit.