NMST539 | Lab Session 13Canonical correlations & correspondance analysisSummer Term 2021/2022 | 10/05/22source file (UTF8 encoding)The R software is available for download (free of charge) from the website: https://www.r-project.org A user-friendly interface (one of many): RStudio. Manuals and introduction into R (in Czech or English):
1. Canonical Correlation Analysis in RThe canonical correlation analysis (CCA) is a multidimensional exploratory statistical method which (roughly) operates on the same principle as the principal component analysis (PCA). However, unlike the PCA where one tries to explore the common variability within just one given data set, the canonical correlation analysis is used to explore the mutual variability (correlations) between two different datasets of quantitative variables observed on the same experimental units. Thus, the PCA method deals with one dataset only and it tries to reduce the overall dimensionality of the dataset using some linear combinations of the initial variables with the maximum variace possible while the CCA approach searches for some linear combinations of the initial variables in both datasets achieving maximum correlations amonng them instead. Considering a theoretical model for CCA, there are two multivariate random vectors \(\boldsymbol{X} = (X_1, \dots, X_p)^\top\) and \(\boldsymbol{Y} = (Y_1, \dots, Y_q)^\top\) which are assumed to have a common variance-covariance matrix (variance-covariance of the overall random vector \((\boldsymbol{X}^\top, \boldsymbol{Y}^\top)^\top\)) is of the form \[ Var \left(\begin{array}{c}\boldsymbol{X}\\ \boldsymbol{Y}\end{array}\right) = \left(\begin{array}{cc} \Sigma_{XX} & \Sigma_{XY}\\ \Sigma_{YX} & \Sigma_{YY} \end{array}\right). \] If \(\boldsymbol{X}\) is a \(p\) dimensional random vector and \(\boldsymbol{Y}\) is a \(q\) dimensional random vector, then the corresponding correlation of any linear combination of \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) in terms of \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\), for \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\) can be simply expressed as \[ Cor(\boldsymbol{a}^\top \boldsymbol{X}, \boldsymbol{b}^\top\boldsymbol{Y}) = \boldsymbol{a}^\top \Sigma_{XY} \boldsymbol{b}. \] Thus, the canonical correlations analysis searches for such vectors \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\), that the previous correlation is as large as possible. Theoretically speaking, the maximum correlation is achieved for \(\boldsymbol{a} \in \mathbb{R}^p\) and \(\boldsymbol{b} \in \mathbb{R}^q\) being defined as \[
\boldsymbol{a} = \Sigma_{XX}^{-1/2} \boldsymbol{\gamma} \qquad \textrm{and} \qquad \boldsymbol{b} = \Sigma_{YY}^{-1/2}\boldsymbol{\delta},
\] where \(\boldsymbol{\gamma} \in \mathbb{R}^p\) and \(\boldsymbol{\delta} \in \mathbb{R}^q\) are the first eigen vectors of the matrix \(\mathcal{K}\mathcal{K}^\top\) and \(\mathcal{K}^\top\mathcal{K}\) respectively, where \[
\mathcal{K} = \Sigma_{XX}^{-1/2}\Sigma_{XY}\Sigma_{YY}^{-1/2}.
\] However, the canonical correlation analysis does not search for just one pair of the canonical variables \(\boldsymbol{a}^\top \boldsymbol{X}\) and \(\boldsymbol{b}^\top\boldsymbol{Y}\) but, instead, it defines all possible pairs (up to \(min(p, q)\)) with a decreasing correlations (similarly as a decreasing variance for the PCA directions), such that all canonical covariates within each dataset are uncorrelated among each other (and each pair of canonical correlations explains the same amount of the mutual variability within its own dataset). From the empirical point of view the canonical correlation analysis takes two datasets meassured on the same units (e.g., a dataset \(\mathcal{X}\) which is a data matrix of the type \(n \times p\) and a dataset \(\mathcal{Y}\), which is a data matrix of the type \(n \times q\)) and produces a pair of new datasets \(\widetilde{\mathcal{X}}\) and \(\widetilde{\mathcal{Y}}\) (both of the same type \(n \times k\), where \(k = min(p, q)\)) such that the new covariates in each dataset are defined as linear combinations of the original covariates. Instead of the theoretical quantities mentioned above one just needs to use the corresponding finite sample surrogates. Finally, the corresponding linear combinations are determined by the estimated parameters \(\boldsymbol{a}_1, \dots, \boldsymbol{a}_k \in \mathbb{R}^p\) and \(\boldsymbol{b}_1, \dots, \boldsymbol{b}_k \in \mathbb{R}^q\) The new covariates within each data set \(\widetilde{\mathcal{X}}\) and \(\widetilde{\mathcal{Y}}\) are, moreover, mutually uncorrelated. In the statistical software R there is a function called In the following we will consider the same dataset we have already worked with before. The data consists of two datasets where each dataset represents measurements at some specific (64-65) geographical river localities in the Czech republic. The first dataset contains measurements of the biologial quality (diversity in terms of 17 some well defined metrics, indexes, and taxons) for each locality and the secon dataset contains measurements on some related chemical concentrations at the same localities (7 different covariates recored). The idea is to somehow relate both dataset (biological and chemical) in order to explain what bio metrics can correlate with some of the underlying chemical concentrations. One way of answering this is to apply the canonical correlation approach as explained above.
Recall that there is one locality which is missing in the dataset of the chemical measurements. The locality can be indentified and excluded from any further analysis. Thus, both datasets are alligned with respect to the locality names – the first column in each dataset (observations in two datasets are all measured on the same experimental units).
The crutial step for CCA is to get the mutual variance-covariance matrix \(\Sigma = Var(\boldsymbol{X}, \boldsymbol{Y})\) which is, however, unknown. The corresponding estimate is used instead. A nice graphical tool to view both datasets with respect to their overall (within and between) correlation structure is avialable within the R package ‘CCA’ (Canonical Correlation Analysis). The packages needs to be firstly installed (
Next, one can already use the
The canonical correlations are the same for both R commands
The canonical covariates can be easily reconstructred as follows:
This corresponds with the output provided by the function
Compare also the estimates of the variance-covariance matrices based on \(\widetilde{\mathcal{X}}\) and \(\widetilde{\mathcal{Y}}\) with the original estimates obtained from the original data \(\mathcal{X}\) and \(\mathcal{Y}\). However, the actual estimated linear combinations of the original covariates in both datasets \(\mathcal{X}\) and \(\mathcal{Y}\) are different. Compare the following:
Using the ‘CCA’ package one can also take an advantage of nice graphical tools which are available for the canonical correlation analysis. The idea is to display maximized correlations between transformed variables of the dataset \(\mathcal{X}\) and the dataset \(\mathcal{Y}\).
On the left panel, there are two sets of the original covariates from the original datasets (biological metrics and chemical concentrations), On the right panel, there are original observations (geographical locations) representated on the plane defined by the first two canonical variates. More details can be found, for instance, in this paper. Questions
Beside the classical canonical correlation approach there are also some usefull generalizations, for instance, a regularized version – so called regularized canonical correlation analysis. It is meant for scenarios where the number of parameters \(p \in \mathbb{N}\) is greater than the sample size \(n \in \mathbb{N}\) (usually \(p \gg n\)). To perform a regularized version of canonical correlations, there is an R function Another option how to perform canonical correlation in R is to install some additional R packages – for instance, the R package ‘vegan’ (use Individual tasks
Note
Another commonly known multivariate method very similar to the method of the canonical correlations is the redundancy analysis. The idea behind is very similar – to use quantitative (numerical) variables available in the first data set and to maximize mutual correlations between some appropriate linear combinations of these covariates with some other covariates in the second dataset. Thus, linear combinations are not used for both but only for one data set. Thus, the redundancy analysis can be seen as an assymetric version of the method of canonical correlations. Analogously, the method can be also seen as some generalization of the linear regression framework where the underlying covariates taken from one dataset play a role of independent variables and their linear combination is used to linearly explain some covariate(s) from the other dataset (dependent covariate(s)). Using the notation from the very begining, the redundancy analysis can be seen as a maximization problem where one maximizes \[ cor(\boldsymbol{a}^\top\boldsymbol{X}, \boldsymbol{b}^\top\boldsymbol{Y}) \] under the restrction that either \(\boldsymbol{a} = \boldsymbol{e}_i \in \mathbb{R}^p\) or \(\boldsymbol{b} = \boldsymbol{e}_1 \in \mathbb{R}^q\) where \(\boldsymbol{e}_i\) is a unit vector with the value of one on the location \(i\) and zeros otherwise. There are various packages for the redundancy analysis available in the R software – e.g., the R library 2. Correspondance analysis in RThe correspondance analysis (CA) can be seen as a discrete analogy of the canonical correlation analysis (CCA). The correspondance analysis is, therefore, meant for qualitative (discrete) type of the data. Similarly as CCA, it also looks for a mutual correlation structure between two discrete random vectors – it look for some dependence patterns among the rows and columns of a contingency table. It can be also seen as some generalization of the principal component analysis or the factor analysis. From the theoretical point of view, in the correspondence analysis we intend to look at the residuals of each cell of a two-way contingency table. Residuals quantify the difference between the observed data and the data we would expect under the assumption that there is no relationship between the row and column categories (recall a standard \(\chi^2\) test of independence in a contingency table). This is a typical question related to contingency tables in general – an independence between two random vectors – a multinomial random vector \(\boldsymbol{X}\) and another multinomial random vector \(\boldsymbol{Y}\). However, the main question of interest may be even more complex. Indeed, one can be interested in some specific rows (columns respectively) while looking for some dependence patterns between some linear combinations of the given categories. In other words, one can be interested in a particular effect of a specific row-column combination to the overall dependence (correlation) structure. The correspondance analysis in R is implemented within the command Using the
The output of the correspondance analysis performed by the standard command
The pictures above illustrate the underlying contingency table however, beside the table itself, there is also some additional information regarding the dependence structure included in the plot. The cells are not equidistantly spanned and they are also not ordered. For more details see the help session Individual task
. Homework Assignment(Deadline: Before requesting the credit)
|