NMST539 | Lab Session 13

Canonical correlations & correspondance analysis

Summer Term 2021/2022 | 10/05/22

source file (UTF8 encoding)

The R software is available for download (free of charge) from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

Theoretical references

Hardle, W. and Simar.L.: Applied Multivariate Statistical Analysis. Springer, 2015
Mardia, K., Kent, J., and Bibby, J.:Multivariate Analysis, Academic Press, 1979.

1. Canonical Correlation Analysis in R

The canonical correlation analysis (CCA) is a multidimensional exploratory statistical method which (roughly) operates on the same principle as the principal component analysis (PCA). However, unlike the PCA where one tries to explore the common variability within just one given data set, the canonical correlation analysis is used to explore the mutual variability (correlations) between two different datasets of quantitative variables observed on the same experimental units. Thus, the PCA method deals with one dataset only and it tries to reduce the overall dimensionality of the dataset using some linear combinations of the initial variables with the maximum variace possible while the CCA approach searches for some linear combinations of the initial variables in both datasets achieving maximum correlations amonng them instead.

Considering a theoretical model for CCA, there are two multivariate random vectors \(\boldsymbol{X} = (X_1, \dots, X_p)^\top\) and \(\boldsymbol{Y} = (Y_1, \dots, Y_q)^\top\) which are assumed to have a common variance-covariance matrix (variance-covariance of the overall random vector \((\boldsymbol{X}^\top, \boldsymbol{Y}^\top)^\top\)) is of the form \[ Var \left(\begin{array}{c}\boldsymbol{X}\\ \boldsymbol{Y}\end{array}\right) = \left(\begin{array}{cc} \Sigma_{XX} & \Sigma_{XY}\\ \Sigma_{YX} & \Sigma_{YY} \end{array}\right). \]

If \(\boldsymbol{X}\) is a \(p\) dimensional random vector and \(\boldsymbol{Y}\) is a \(q\) dimensional random vector, then the corresponding correlation of any linear combination of \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) in terms of \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\), for \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\) can be simply expressed as \[ Cor(\boldsymbol{a}^\top \boldsymbol{X}, \boldsymbol{b}^\top\boldsymbol{Y}) = \boldsymbol{a}^\top \Sigma_{XY} \boldsymbol{b}. \]

Thus, the canonical correlations analysis searches for such vectors \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\), that the previous correlation is as large as possible. Theoretically speaking, the maximum correlation is achieved for \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\) being defined as \[ \boldsymbol{a} = \Sigma_{XX}^{-1/2} \boldsymbol{\gamma} \qquad \textrm{and} \qquad \boldsymbol{b} = \Sigma_{YY}^{-1/2}\boldsymbol{\delta}, \] where \(\boldsymbol{\gamma} \in \mathbb{R}^p\) and \(\boldsymbol{\delta} \in \mathbb{R}^q\) are the first eigen vectors of the matrix \(\mathcal{K}\mathcal{K}^\top\) and \(\mathcal{K}^\top\mathcal{K}\) respectively, where \[ \mathcal{K} = \Sigma_{XX}^{-1/2}\Sigma_{XY}\Sigma_{YY}^{-1/2}. \] However, the canonical correlation analysis does not search for just one pair of the canonical variables \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\) but, instead, it defines all possible pairs (up to \(min(p, q)\)) with a decreasing correlations (similarly as a decreasing variance for the PCA directions), such that all canonical covariates within each dataset are uncorrelated among each other (and each pair of canonical correlations explains the same amount of the mutual variability within its own dataset).

From the empirical point of view the canonical correlation analysis takes two datasets meassured on the same units (e.g., a dataset \(\mathcal{X}\) which is a data matrix of the type \(n \times p\) and a dataset \(\mathcal{Y}\), which is a data matrix of the type \(n \times q\)) and produces a pair of new datasets \(\widetilde{\mathcal{X}}\) and \(\widetilde{\mathcal{Y}}\) (both of the same type \(n \times k\), where \(k = min(p, q)\)) such that the new covariates in each dataset are defined as linear combinations of the original covariates. Instead of the theoretical quantities mentioned above one just needs to use the corresponding finite sample surrogates. Finally, the corresponding linear combinations are determined by the estimated parameters \(\boldsymbol{a}_1, \dots, \boldsymbol{a}_k \in \mathbb{R}^p\) and \(\boldsymbol{b}_1, \dots, \boldsymbol{b}_k \in \mathbb{R}^q\) The new covariates within each data set \(\widetilde{\mathcal{X}}\) and \(\widetilde{\mathcal{Y}}\) are, moreover, mutually uncorrelated.

In the statistical software R there is a function called cancor() available under the standard instalation. Some additional packages can be downloaded and installed as well (in particular, the R packages ‘CCA’ and ‘vegan’ can be usefull).

In the following we will consider the same dataset we have already worked with before. The data consists of two datasets where each dataset represents measurements at some specific (64-65) geographical river localities in the Czech republic. The first dataset contains measurements of the biologial quality (diversity in terms of 17 some well defined metrics, indexes, and taxons) for each locality and the secon dataset contains measurements on some related chemical concentrations at the same localities (7 different covariates recored).

The idea is to somehow relate both dataset (biological and chemical) in order to explain what bio metrics can correlate with some of the underlying chemical concentrations. One way of answering this is to apply the canonical correlation approach as explained above.

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)

head(bioData)

##   Kod_Canoco SaprInd    Lital    RETI   EPTAbu    Marg Metaritr   JepAbu
## 1   BecvChor 2.57620 31.96916 0.37453 26.90136 7.10911 16.25647 14.89330
## 2   BecvOsek 2.43686 28.12037 0.27790 23.32888 5.80443 14.14831 11.06343
## 3   BecvTrou 2.10235 21.30691 0.31422 36.49575 6.14622 11.21903 30.23011
## 4   BelaBosk 1.53297 37.12914 0.48181 38.49802 9.36742 25.32146 10.11182
## 5   BilyPoto 1.79499 27.21495 0.37100 23.26918 8.62000 17.88813  5.95577
## 6   BlatTova 2.90209 16.04854 0.30351  8.64187 5.51661  7.98030  4.71039
##   Epiritral Hyporitral Ntaxon Nceled  Bindex EPTTax  PosAbu  Spasaci
## 1   7.68869   22.52242     18     40 0.64865     23 0.33939 28.29352
## 2   6.96659   19.74992     12     25 0.37838     16 0.00000 20.84029
## 3   5.92213   18.63735      9     25 0.16216     12 1.47017 20.88415
## 4  20.37168   18.17237     16     40 0.72000     30 8.07504 28.62651
## 5  12.08359   19.09771     21     36 0.54167     21 0.54567 22.02324
## 6   5.23533   14.60371     15     27 0.09091      5 0.00000 16.78539
##         MDS_1      MDS_2
## 1 -0.69831500 -0.7040517
## 2 -0.61323430 -0.3796887
## 3 -0.08536265  1.6051440
## 4  0.55758620 -0.4424091
## 5  0.40888510 -0.2172988
## 6 -0.76699300  0.5244099

head(chemData)

##   Kod_Canoco Tepl_max X.O2 BSK5 Kond N.NH4 N.NO3 Pcelk
## 1   BecvChor     20.9  107  1.3 33.7  0.09  1.55 0.077
## 2   BecvOsek     21.8  105  1.2 37.0  0.06  2.32 0.063
## 3   BecvTrou     22.1  106  1.4 43.9  0.07  2.49 0.080
## 4   BelaBosk     16.4  104  1.1 22.1  0.02  2.05 0.029
## 5   BilyPoto     18.9  104  1.9 30.4  0.10  4.48 0.131
## 6   BrezJaro     18.4   84  2.5 83.8  0.30  3.63 0.237

Recall that there is one locality which is missing in the dataset of the chemical measurements. The locality can be indentified and excluded from any further analysis. Thus, both datasets are alligned with respect to the locality names – the first column in each dataset (observations in two datasets are all measured on the same experimental units).

ind <- match(chemData[,1], bioData[,1])
data <- data.frame(bioData[ind, ], chemData[, 2:8])

X <- data[,2:9]
Y <- data[,19:25]

The crutial step for CCA is to get the mutual variance-covariance matrix \(\Sigma = Var(\boldsymbol{X}, \boldsymbol{Y})\) which is, however, unknown. The corresponding estimate is used instead. A nice graphical tool to view both datasets with respect to their overall (within and between) correlation structure is avialable within the R package ‘CCA’ (Canonical Correlation Analysis). The packages needs to be firstly installed (install.packages("CCA")) and later one can use functions matcor() and img.matcor() to graphically display the sample version of the corration matrix \(cor(\mathcal{X}, \mathcal{Y})\).

library("CCA")
correl <- matcor(X, Y )
img.matcor(correl, type = 2)

Next, one can already use the cancor() function to obtain all canonical correlations for the pair of the datasets \(\mathcal{X}\) and \(\mathcal{Y}\). Another option is to use the function cc() (from the R package ‘CCA’). Note, the the results are equivalent although not identical.

cc1 <- cancor(X, Y)  ### function from standard R instalation
cc2 <- cc(X, Y)      ### function for the R package 'CCA'

The canonical correlations are the same for both R commands

cc1$cor  ### function from standard R instalation

## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830 0.04481277

cc2$cor  ### function for the R package 'CCA'

## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830 0.04481277

The canonical covariates can be easily reconstructred as follows:

CanX <- as.matrix(X) %*% cc1$xcoef
CanY <- as.matrix(Y) %*% cc1$ycoef

cor(CanX, CanY)

##               [,1]          [,2]          [,3]          [,4]          [,5]
## [1,]  8.749611e-01 -5.641494e-17 -2.837332e-16  2.876720e-16 -2.755136e-16
## [2,]  1.969775e-18  7.049659e-01  6.045359e-17  3.672712e-16  2.833397e-16
## [3,]  2.055352e-15 -6.475529e-16  6.457969e-01 -8.881386e-16 -4.425705e-17
## [4,]  5.598600e-16 -4.649601e-16 -7.193021e-16  4.875075e-01 -1.535789e-16
## [5,] -6.529418e-16  3.505207e-16  1.774706e-16 -7.011448e-18  2.420530e-01
## [6,]  1.389060e-16 -3.532212e-17 -3.754766e-16 -4.455948e-16  5.118790e-16
## [7,] -8.693158e-16  4.868860e-16 -2.229543e-16  2.685094e-19  3.028397e-17
## [8,] -1.861723e-16  3.088850e-16 -3.129291e-16 -4.233271e-16 -4.250917e-16
##               [,6]          [,7]
## [1,]  2.598782e-17  8.001412e-17
## [2,] -4.062772e-17 -1.190124e-16
## [3,]  5.531278e-16  1.034287e-16
## [4,] -2.964996e-16 -6.147935e-17
## [5,] -2.659521e-16  1.510269e-16
## [6,]  1.156383e-01  8.425013e-18
## [7,] -4.480597e-16  4.481277e-02
## [8,]  6.107525e-16  2.362837e-16

This corresponds with the output provided by the function cancor() and cc()above (see also the picture below for a visual comparison).

par(mfrow = c(1,2))
barplot(cc1$cor, main = "Canonical correlations for 'cancor()'", col = "gray")
barplot(cc2$cor, main = "Canonical correlations for 'cc()'", col = "gray")

Compare also the estimates of the variance-covariance matrices based on \(\widetilde{\mathcal{X}}\) and \(\widetilde{\mathcal{Y}}\) with the original estimates obtained from the original data \(\mathcal{X}\) and \(\mathcal{Y}\).

However, the actual estimated linear combinations of the original covariates in both datasets \(\mathcal{X}\) and \(\mathcal{Y}\) are different. Compare the following:

cc1$xcoef  ### function from standard R instalation

##                   [,1]         [,2]         [,3]         [,4]         [,5]
## SaprInd    0.162796345  0.089714234 -0.161737589 -0.491927540 -0.186875530
## Lital     -0.003800701  0.001797172 -0.030965801 -0.007214761  0.017517637
## RETI       0.324209881 -0.055178843  0.693084411 -1.283570658 -2.507634813
## EPTAbu    -0.002149837 -0.005077822 -0.009027554  0.008345945  0.005867073
## Marg      -0.011912887 -0.037901448 -0.009666364 -0.065884327 -0.025639954
## Metaritr  -0.012221662  0.001551030  0.043003790  0.012178243 -0.003013795
## JepAbu     0.005168862  0.006308244  0.004048909 -0.009547646 -0.019794148
## Epiritral  0.010432924  0.023391927 -0.013976612 -0.021306313  0.018722542
##                   [,6]        [,7]         [,8]
## SaprInd   -0.098864028  0.06897063 -0.355850803
## Lital      0.004465308 -0.01327305  0.008943729
## RETI       3.047758363  1.14371548  0.285767761
## EPTAbu    -0.005401486  0.01277878 -0.026098676
## Marg      -0.029969261  0.01102501 -0.002459819
## Metaritr  -0.027620185 -0.04963561 -0.056954743
## JepAbu    -0.005346595 -0.00253416  0.030139906
## Epiritral -0.022760565  0.04740075  0.064087475

cc2$xcoef  ### function for the R package 'CCA'

##                  [,1]        [,2]        [,3]         [,4]         [,5]
## SaprInd   -1.29215593 -0.71208465 -1.28375231  -3.90455380  -1.48327854
## Lital      0.03016713 -0.01426461 -0.24578343  -0.05726539   0.13904193
## RETI      -2.57333615  0.43796849  5.50118696 -10.18802625 -19.90373428
## EPTAbu     0.01706380  0.04030396 -0.07165399   0.06624389   0.04656845
## Marg       0.09455561  0.30083342 -0.07672439  -0.52294063  -0.20351082
## Metaritr   0.09700644 -0.01231092  0.34133200   0.09666181  -0.02392126
## JepAbu    -0.04102657 -0.05007014  0.03213722  -0.07578209  -0.15711118
## Epiritral -0.08280877 -0.18566767 -0.11093592  -0.16911362   0.14860557
##                  [,6]        [,7]
## SaprInd   -0.78470889  0.54743742
## Lital      0.03544229 -0.10535158
## RETI      24.19083206  9.07796019
## EPTAbu    -0.04287297  0.10142843
## Marg      -0.23787363  0.08750829
## Metaritr  -0.21922842 -0.39397046
## JepAbu    -0.04243728 -0.02011427
## Epiritral -0.18065638  0.37623175

cc1$ycoef

##                  [,1]         [,2]         [,3]         [,4]         [,5]
## Tepl_max  0.002728344 -0.031587883 -0.022171030  0.015570969 -0.026678176
## X.O2      0.001145406 -0.004145546  0.005890339 -0.012452124 -0.009542703
## BSK5      0.015042715  0.054146231 -0.027996459  0.124516411 -0.026809781
## Kond      0.002545019  0.001425917 -0.004281998  0.001343352  0.001807199
## N.NH4     0.307216652  0.432469799  1.131659010  0.087843642 -0.818023849
## N.NO3    -0.010468525 -0.030934549  0.053453456  0.031622860  0.022842863
## Pcelk     0.412696634 -0.979367658  0.196048060 -3.071986907  0.807813337
##                  [,6]          [,7]
## Tepl_max  0.024294508  0.0072354037
## X.O2     -0.021080517 -0.0001162876
## BSK5     -0.122653042 -0.1252271666
## Kond     -0.003536074  0.0038599055
## N.NH4     0.123847014  0.5657683896
## N.NO3    -0.011571285  0.0045994691
## Pcelk     1.447735216 -1.1452167044

cc2$ycoef

##                  [,1]        [,2]        [,3]         [,4]        [,5]
## Tepl_max -0.021655562  0.25072104 -0.17597710   0.12359074 -0.21175146
## X.O2     -0.009091381  0.03290426  0.04675311  -0.09883567 -0.07574285
## BSK5     -0.119397849 -0.42977239 -0.22221500   0.98831837 -0.21279604
## Kond     -0.020200461 -0.01131786 -0.03398730   0.01066252  0.01434420
## N.NH4    -2.438456583 -3.43262261  8.98226493   0.69723729 -6.49286301
## N.NO3     0.083091342  0.24553537  0.42427365   0.25099867  0.18130960
## Pcelk    -3.275677981  7.77348979  1.55608323 -24.38314016  6.41181959
##                 [,6]          [,7]
## Tepl_max  0.19283168  0.0574292361
## X.O2     -0.16732142 -0.0009230045
## BSK5     -0.97352834 -0.9939598209
## Kond     -0.02806672  0.0306370503
## N.NH4     0.98300520  4.4906473759
## N.NO3    -0.09184423  0.0365071542
## Pcelk    11.49104203 -9.0898757913

Using the ‘CCA’ package one can also take an advantage of nice graphical tools which are available for the canonical correlation analysis. The idea is to display maximized correlations between transformed variables of the dataset \(\mathcal{X}\) and the dataset \(\mathcal{Y}\).

plt.cc(cc2, var.label = TRUE, ind.names = data[,1])

On the left panel, there are two sets of the original covariates from the original datasets (biological metrics and chemical concentrations), On the right panel, there are original observations (geographical locations) representated on the plane defined by the first two canonical variates.

More details can be found, for instance, in this paper.

Questions

How would you interpret the left hand side of the figure?

What is the relationship between the subfigure on the left hand side and the subfigure on the right hand side? Can you reconstruct the figure on the right?

plot(CanX[,1], CanX[,2], pch = 22,bg = "gray", xlab = "Canonical dimension 1", ylab = "Canonical dimension 2")
text(CanX[,1], CanX[,2], labels = data[,1], pos = 4, cex = 0.8)

What would be your conclussion about the two available dataset and their mutual relationship?

Beside the classical canonical correlation approach there are also some usefull generalizations, for instance, a regularized version – so called regularized canonical correlation analysis. It is meant for scenarios where the number of parameters \(p \in \mathbb{N}\) is greater than the sample size \(n \in \mathbb{N}\) (usually \(p \gg n\)). To perform a regularized version of canonical correlations, there is an R function rcc() available in the R package ‘CCA’.

Another option how to perform canonical correlation in R is to install some additional R packages – for instance, the R package ‘vegan’ (use install.packages("vegan") for instalation) – and to apply the command cca(). Use the help session in R for more details about the function itself or the corresponding package respectively.

Individual tasks

Use the plots above and set different canonical directions for plots. Compare the outputs.
Consider a lower dimensional profile and try to refit an analogous canonical correlation analysis on a lower dimensional datasets (e.g., two covariates for the biological data and two covariates for the chemical data only).
Try to understand the details you can see on the figures.

Note

Redundancy Analysis

Another commonly known multivariate method very similar to the method of the canonical correlations is the redundancy analysis. The idea behind is very similar – to use quantitative (numerical) variables available in the first data set and to maximize mutual correlations between some appropriate linear combinations of these covariates with some other covariates in the second dataset. Thus, linear combinations are not used for both but only for one data set. Thus, the redundancy analysis can be seen as an assymetric version of the method of canonical correlations.

Analogously, the method can be also seen as some generalization of the linear regression framework where the underlying covariates taken from one dataset play a role of independent variables and their linear combination is used to linearly explain some covariate(s) from the other dataset (dependent covariate(s)).

Using the notation from the very begining, the redundancy analysis can be seen as a maximization problem where one maximizes \[ cor(\boldsymbol{a}^\top\boldsymbol{X}, \boldsymbol{b}^\top\boldsymbol{Y}) \]

under the restrction that either \(\boldsymbol{a} = \boldsymbol{e}_i \in \mathbb{R}^p\) or \(\boldsymbol{b} = \boldsymbol{e}_1 \in \mathbb{R}^q\) where \(\boldsymbol{e}_i\) is a unit vector with the value of one on the location \(i\) and zeros otherwise.

There are various packages for the redundancy analysis available in the R software – e.g., the R library vegan, or segRDA.

2. Correspondance analysis in R

The correspondance analysis (CA) can be seen as a discrete analogy of the canonical correlation analysis (CCA). The correspondance analysis is, therefore, meant for qualitative (discrete) type of the data. Similarly as CCA, it also looks for a mutual correlation structure between two discrete random vectors – it look for some dependence patterns among the rows and columns of a contingency table. It can be also seen as some generalization of the principal component analysis or the factor analysis.

From the theoretical point of view, in the correspondence analysis we intend to look at the residuals of each cell of a two-way contingency table. Residuals quantify the difference between the observed data and the data we would expect under the assumption that there is no relationship between the row and column categories (recall a standard \(\chi^2\) test of independence in a contingency table).

This is a typical question related to contingency tables in general – an independence between two random vectors – a multinomial random vector \(\boldsymbol{X}\) and another multinomial random vector \(\boldsymbol{Y}\). However, the main question of interest may be even more complex. Indeed, one can be interested in some specific rows (columns respectively) while looking for some dependence patterns between some linear combinations of the given categories. In other words, one can be interested in a particular effect of a specific row-column combination to the overall dependence (correlation) structure.

The correspondance analysis in R is implemented within the command corresp() (available under the library ‘MASS’). Additional packages are, for instance, ‘ca’, ‘ade4’,‘FactoMineR’, or ‘factoextra’.

Using the mtcars data set for illustration we can be interested in a specific contingency table for the number of cylinders (variable cyl) and the number of transmission gears (variable gear).

X <- mtcars[, "cyl"]
Y <- mtcars[, "gear"]

table(X, Y)

##    Y
## X    3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

The output of the correspondance analysis performed by the standard command corresp() from the R library ‘MASS’ is as follows:

library(MASS)
(corAn <- corresp(~ Y + X))

## First canonical correlation(s): 0.750054 
## 
##  Y scores:
##          3          4          5 
## -0.9949425  1.1929373  0.1217779 
## 
##  X scores:
##          4          6          8 
##  1.0656337  0.5530348 -1.1138011

plot(corAn, xlab = "Motor cylinders", ylab = "Transmission gears")

The pictures above illustrate the underlying contingency table however, beside the table itself, there is also some additional information regarding the dependence structure included in the plot. The cells are not equidistantly spanned and they are also not ordered. For more details see the help session ?corresp for the R command corresp().

Individual task

Standard \(\chi^2\) test of independence:

chisq.test(table(X, Y))

## 
##  Pearson's Chi-squared test
## 
## data:  table(X, Y)
## X-squared = 18.036, df = 4, p-value = 0.001214

Additional information can be found here: FactoMineR / factoextra

Homework Assignment

(Deadline: Before requesting the credit)

Complete all partial homework assignments if you are interested in obtaining the final credit.