NMST539 | Lab Session 6

Principal Component Analysis

(applications and examples in R))

Summer Term 2021/2022 | 29/03/22

source file (UTF8 encoding)

The R software is available for download (free of charge) from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
  • Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
  • Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
  • Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)



Theoretical references
  • Hardle, W. and Simar.L.: Applied Multivariate Statistical Analysis. Springer, 2015
  • Mardia, K., Kent, J., and Bibby, J.:Multivariate Analysis, Academic Press, 1979.

1. Principal component analysis / theoretical aspects

The principal component analysis (PCA) - is a multivariate exploratory method based on the variance-covariance matrix of some random vector \(\boldsymbol{X} \in \mathbb{R}^{p}\). Of course, instead of working with the (theoretical) vector itself and the (usually uknown) variance-convariance matrix we use a random sample drawn from the same distribution and the sample version of the variance covariance matrix instead.

The main idea is to consider \(\ell < p\) principal components instead of the whole \(p \in \mathbb{N}\) dimensional random vector (dimensionality reduction). In other words, we look for a new \(\ell\)-dimensional random representation of the original \(p\)-dimensional random vector. The principal components are orthogonal linear combinations of the original covariates, and, in addition, they are ordered with respect to a decreasing variance. In most cases \(\ell \ll p\) (a significant reduction of dimensionality).

In general, the principal components can be seen from (at least four) different perspectives:

  • orthogonal directions within the data with the highest proportional variance;
  • best linear subspace approximation to the data (projection into a linear subspace) ;
  • eigen value decomposition of the (sample) variance-covariance matrix;
  • singular value decomposition of the data frame;

Statistically speaking, the principal components are linear combinations of the original covariates, such that these new covariates are mutually uncorrelated, with the zero mean value and the variace-covariance matrix which is diagonal, with the eigen values of the (theoretical/empirical) variance-covariance matrix on its diagonal.



Formally speaking, PCA can be seen as a maximization task, where the first principal component \(\boldsymbol{\delta}_1^\top \boldsymbol{X}\) (i.e., a linear combination of the elements of the vector \(\boldsymbol{X} \in \mathbb{R}^{p}\)), defined by the vector of unknown parameters \(\boldsymbol{\delta}_1 \in \mathbb{R}^p\), is obtained as a solution of the maximization problem \[ \boldsymbol{\delta}_1 = Argmax_{\boldsymbol{\{\delta};~\|\boldsymbol{\delta}\| = 1 \}} ~~Var (\boldsymbol{\delta}^\top \boldsymbol{X}). \] The second principal component \(\boldsymbol{\delta}_2^\top \boldsymbol{X}\) (again given as a linear combination of the vector elements) is defined as an analogous maximization task, however, using an additional constraint regarding orthogonality – meaning that \(\boldsymbol{\delta}_1^\top\boldsymbol{\delta}_2 = 0\), or, alternatively, \(\boldsymbol{\delta}_1^\top \boldsymbol{X}\) and \(\boldsymbol{\delta}_2^\top \boldsymbol{X}\) are uncorrelated random variables. Thus, it holds, that \[ Cov(\boldsymbol{\delta}_1^\top \boldsymbol{X}, \boldsymbol{\delta}_2^\top \boldsymbol{X}) = 0. \]

The principal components can be, howeever, also seen from a geometrical perspective of some specific projections into well defined linear subspaces – such as those used for the linear regression framework o the totally least squares - TLS.



2. Principal components, orthogonal projections, and regression

Let \(\boldsymbol{Y}\) be some random vector of the length \(n \in \mathbb{N}\) and, in addition, we also have some data matrix \(\mathbb{X}\) of the type \(n \times p\), where, in general, \(p \in \mathbb{N}\) is considered to be less than \(n\). A common task in regression is to use the information obtained in \(\mathbb{X}\) and to approximate the \(n\) dimensional random vector \(\boldsymbol{Y}\) in a lower dimensional space \(\mathcal{L}(\mathbb{X})\) – a \(p\) dimensional space generated by the columns of the matrix \(\mathbb{X}\).

If we denote the orthogonal projection of \(\boldsymbol{Y}\) into \(\mathcal{L}(\mathbb{X})\) as \(P_{\mathbb{X}}(\boldsymbol{Y})\) then the linear regression model for regessing \(\boldsymbol{Y}\) on \(\mathbb{X}\) in the form \(P_{\mathbb{X}}(\boldsymbol{Y}) = a_1 \boldsymbol{X}_1 + \dots + a_p \boldsymbol{X}_p\) gives the minumum error \[ \| \boldsymbol{Y} - P_{\mathbb{X}}(\boldsymbol{Y}) \|_2^2. \]

The corresponding projection matrix (which needs to be idempotent) is \[ \mathbb{H} = \mathbb{X}(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top = \mathbb{Q}\mathbb{Q}^\top, \] where \(\mathbb{X}\) contains the basis vectors of \(\mathcal{L}(\mathbb{X})\) in its columnts, or, respectively, the colums of \(\mathbb{Q}\) give the orthogonal basis of the same linear space. Obviously, it also holds, that \[ P_{\mathbb{X}}(\boldsymbol{Y}) = \mathbb{H}\boldsymbol{Y} = \mathbb{X}(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \boldsymbol{Y}. \]

This concept can be easily generalized for any orthogonal projection of \(\boldsymbol{Y}\) into a linear span of the columns of some general \(n \times p\) matrix \(\mathbb{B}\), such that \(\mathbb{B}^\top\mathbb{B} = \mathbb{I}\). Let this projection be denoted as \(P_{\mathbb{B}}(\boldsymbol{Y})\). Then it holds that \(P_{\mathbb{B}}(\boldsymbol{Y}) = \mathbb{B}\mathbb{B}^\top\boldsymbol{Y}\) and, moreover, it can be shown that \[ \|\boldsymbol{Y} - P_{\mathbb{\Gamma}}(\boldsymbol{Y})\|_2^2 \leq \|\boldsymbol{Y} - P_\mathbb{B}(\boldsymbol{Y})\|_2^2 \] where \(\Gamma\) is the \(n\times p\) matrix with the first eigen vectors of the variance covariance matrix \(\Sigma\) (see Eckart and Young (1936) and Mirsky (1960) approximation).



The following example shows three specific orthogonal projections of the random vector \(\boldsymbol{Y}\) into a linear subspace being generated by the columns of some (well-defined) matrix of type \(n \times p\). The same data – random vector \(\boldsymbol{Y}\) – is used for all projections but the linear subspace where \(\boldsymbol{Y}\) is projected into, is different in each case.

set.seed(1234)
x <- seq(0,1, length = 100)
y <- 0 + 2 * x + rnorm(100)

par(mfrow = c(1,3))
### regression of Y on X
plot(y ~ x, pch = 21, bg = "lightblue", xlab = "X variable", ylab = "Y variable", main = "Projection of Y onto L(X)", ylim = c(-1.5, 4))
m <- lm(y ~ x)
abline(m, col = "red", lwd = 2)

index <- c(10, 60, 80)

for (i in 1:100){
  lines(c(x[i], x[i]), c(y[i], m$fitted[i]), col = "blue", lty =2)
}


### regression of X on Y 
plot(y ~ x, pch = 21, bg = "lightblue", xlab = "X variable", ylab = "Y variable", main = "Projection of X onto L(Y)", ylim = c(-1.5, 4))
m <- lm(x ~ y)
lines((x - m$coeff[1])/m$coeff[2] ~ x, col = "red", lwd = 2)

index <- c(10, 60, 80)

for (i in 1:100){
  lines(c(x[i], m$coeff[1] + m$coeff[2] * y[i]), c(y[i], y[i]), col = "blue", lty =2)
}


### regression onto the first PC

### data standardization
means <- apply(cbind(x,y), 2, mean)
sdErs <- apply(cbind(x,y), 2, sd)
cd <- cbind((x - means[1])/sdErs[1], (y - means[2])/sdErs[2])

### first eigen vector
firstEigenVector <- eigen(cov(cd))$vectors[,1]

### projection onto the linear space generated by the first eigen vector
project2pc1 <- t(firstEigenVector) %*% t(cd)

### rotation back to the original coordinates
rotation <- as.numeric(project2pc1) %*% t(firstEigenVector)

### rescaling to the original mean/variance
rescaled <- cbind(rotation[,1] * sdErs[1] + means[1], rotation[,2] * sdErs[2] + means[2])

plot(y ~ x, pch = 21, bg = "lightblue", xlab = "X variable", ylab = "Y variable", main = "Projection onto 1.PC", ylim = c(-1.5,4), xlim = c(-0, 1))

### first PC
lines(rescaled[,2] ~ rescaled[,1], col = "red", lwd = 2)

### orthogonal projections (but they are not orthogonal)
for (i in 1:100){
  lines(c(x[i], rescaled[i,1]), c(y[i], rescaled[i,2]), col = "blue", lty =2)
}

Alternatively, instead of finding the spectral decomposition of the (theoretical/empirical) variance-covariance matrix \(\Sigma\) we can use an idea of a standard linear regression approach and we can alternate between two different regression problems. For some data \(\boldsymbol{Y}_1, \dots, \boldsymbol{Y_n} \in \mathbb{R}^p\) we have \[ \sum_{i = 1}^n \|\boldsymbol{Y}_i - P_\mathbb{B}(\boldsymbol{Y}_i)\|_2^2 = \sum_{i = 1}^n \|\boldsymbol{Y}_i - \mathbb{B}\mathbb{B}^\top\boldsymbol{Y}_i\|_2^2 = \sum_{i = 1}^n \|\boldsymbol{Y}_i - \mathbb{B} \boldsymbol{v}_i\|_2^2 = \sum_{i = 1}^{n}\sum_{j = 1}^{p} (Y_{i j} - \boldsymbol{b}_j\boldsymbol{v}_{i})^2, \] where \(\boldsymbol{v}_i = \mathbb{B}^\top\boldsymbol{Y}_i\), for \(i = 1, \dots, n\), and \(\boldsymbol{b}_j\) is the corresponding row of \(\mathbb{B}\). However, neither \(\mathbb{B}\) nor the vectors \(\boldsymbol{v}_i\) are known. Therefore, we need to minimize the term on the right hand side with respect to some orthogonal matrix \(\mathbb{B}\), such that \(\mathbb{B}^\top\mathbb{B} = \mathbb{I}\) and some vectors \(\boldsymbol{v}_1, \dots, \boldsymbol{v}_n\). This can be easily done by an alternating regression approach.

The expression \[ \sum_{i = 1}^{n} [Y_{i j} - \boldsymbol{b}_j^\top\boldsymbol{v}_i]^2 \] is minimized with respect to \(\mathbb{B}\) (in the first step), so it holds that \(\mathbb{B}^\top\mathbb{B} = \mathbb{I}\) and it is also minimized with respect to \(\boldsymbol{v}_i\) (in the second step). Both steps are alternated until a sufficient precision (convergence) is reached.

The main idea is to find the best orthogonal projection into a linear subspace generated by the orthogonal matrix \(\mathbb{B}\) (matrix columnts), such that the sum of squared errors is as small as possible. From the linear regression theory it is known that under the additional assumption of linearity the best estimated is given within the linear regression framework.

However, from a general point of view (without using the linearity restriction) it is possible to obtain a better (orthogonal) projection. If the matrix \(\mathbb{B}\) contains the first \(p \in \mathbb{N}\) principal components (in columns) it can be used to define the best orthogonal projection of the vector \(\boldsymbol{Y}\) into a \(p\)-dimensional linear subspace (considering of course the least squares criterion).

For a brief illustration see the next example (by Matías Salibián-Barrera).

alter.pca.k1 <- function(x, max.it = 500, eps=1e-10) {
  n2 <- function(a) sum(a^2)
  p <- dim(x)[2]
  x <- scale(x, scale=FALSE)
  it <- 0
  old.a <- c(1, rep(0, p-1))
  err <- 10*eps
  while( ((it <- it + 1) < max.it) & (abs(err) > eps) ) {
    b <- as.vector( x %*% old.a ) / n2(old.a)
    a <- as.vector( t(x) %*% b ) / n2(b)
    a <- a / sqrt(n2(a))
    err <- sqrt(n2(a - old.a))
    old.a <- a
  }
  conv <- (it < max.it)
  return(list(a=a, b=b, conv=conv))
}


set.seed(1234)
n <- 20
p <- 5
x <- matrix(rt(n*p, df=2), n, p)

And we can obtain the first principal component by adopting the alternating regression approach:

alter.pca.k1(x)$a
## [1]  0.08363314 -0.95027213 -0.01427383  0.11629502 -0.27615231

which can be directly compared with the eigen vector itself (obtained by the singular value decomposition or, equivalently, by the eigen value decomposition instead):

svd(cov(x))$u[,1]
## [1] -0.08363314  0.95027213  0.01427383 -0.11629502  0.27615231
eigen(cov(x))$vectors[,1]
## [1] -0.08363314  0.95027213  0.01427383 -0.11629502  0.27615231



For some practical applications it can be useful to investigate the empirical performance of both approaches. This can be effectively done, for instance, by comparing the computational times needed for both approaches (for this purpose we consider a higher dimensional example).

n <- 2000
p <- 500
x <- matrix(rt(n*p, df=2), n, p)

system.time( tmp <- alter.pca.k1(x) )
##    user  system elapsed 
##   0.454   0.032   0.488
system.time( e1 <- svd(cov(x))$u[,1] )
##    user  system elapsed 
##   1.126   0.008   1.140



3. Principal component analysis in R

The principal component analysis performed by the R command princomp() uses the eigen value decomposition of the sample variance-covariance matrix \(\mathcal{S}\) of some data \(\mathcal{X} = (\boldsymbol{X}_1, \dots, \boldsymbol{X}_p)\) (denoted as \(EIV(\mathcal{S})\)) while the second command, prcomp(), uses the singular value decomposition of the data matrix \(\mathcal{X}\) instead (denoted as \(SVD(\mathcal{X})\)).

Let the eigen value decomposition (the singular value decomposition respectively) is of the form \[ EIV(\mathcal{S}) = \Gamma \Lambda \Gamma^\top \quad \quad \textrm{respectively} \quad \quad SVD(\mathcal{X}) = UDV^\top. \]

Then we easily obtain, that both methods are equivalent as we have \[ n \mathcal{S} = \mathcal{X}^\top\mathcal{X} = VDU^\top UD V^\top = V D^2 V^\top = \Gamma \widetilde{\Lambda} \Gamma^\top = EIV(n\mathcal{S}), \] for \(\Gamma \equiv V\) and \(\widetilde{\Lambda} \equiv D^2\). Moreover, it holds that \[EIV(n\mathcal{S}) = \Gamma \widetilde{\Lambda} \Gamma^\top = \Gamma (n \Lambda) \Gamma^\top = n \Gamma \Lambda \Gamma^\top = n EIV(\mathcal{S}).\] Thus, we multiply the square matrix by some constant, then the eigen vectors remain the same, just the eigen values multiply correspondingly.

Individual tasks


The principal component analysis relies on the eigen vector decomposition of the sample variance-covariance matrix. Alternatively, one can use a singular value decomposition of the data matrix itself. However, in both situations it is crucial to understand the role of the eigen vectors and the eigen values. See the following examples to get some useful insight.




For illustration purposes we use the data sample which represents 65 river localities in the Czech Republic. For each locality there are 17 biological metrics recorded. Each biological metric is used as a qualitative/quantitative status of some biological activity (for instance, phytobenthos, macrobenthos, fishes) within each locality expressed in terms of some well defined and internationally recognized metrics - indexes). The idea was to compare the quality of the fresh water environments across all European union member states.

The data file can be obtained from the following address:

data <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
attach(data)

One can use a fancy way to visualize the correlation structure within the data (using the library ‘corrplot’ which needs to be installed by install.packages("corrplot") firstly).

library(corrplot)
PCdata <- data[,-c(1,11,12,14)]
corrplot(cor(PCdata), method="ellipse")

Another nice alternative can be obtained using the qgraph() command from the library ‘qgraph’ (again, the library needs to be installed by the command install.packages("qgraph"")).

library("qgraph")
## Warning: package 'qgraph' was built under R version 4.0.5
qgraph(cor(PCdata))



princomp() vs. prcomp()

There are various libraries PCA available for the R environment: standard commands are prcomp() and princomp(), which are both available under the classical R installation (library ‘stats’). While the first one is using the singular value decomposition of the data matrix (SVD) the second one uses the spectral decomposition of the variance-covariance (correlation) matrix instead. For additional details see the help sessions for both functions (type ?prcomp() and ?princomp()).

We can compare two results obtained by the two commands mentioned above:

PCdata <- data[,-c(1,11,12,14)]
summary(prcomp(PCdata))
## Importance of components:
##                            PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     23.0661 8.1033 4.50909 3.42042 2.17783 1.82627 1.65944
## Proportion of Variance  0.8273 0.1021 0.03162 0.01819 0.00738 0.00519 0.00428
## Cumulative Proportion   0.8273 0.9294 0.96104 0.97923 0.98661 0.99180 0.99608
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.37629 0.60761 0.42677 0.22385 0.13114 0.09806 0.02037
## Proportion of Variance 0.00295 0.00057 0.00028 0.00008 0.00003 0.00001 0.00000
## Cumulative Proportion  0.99902 0.99960 0.99988 0.99996 0.99998 1.00000 1.00000
summary(princomp(PCdata))
## Importance of components:
##                            Comp.1    Comp.2    Comp.3     Comp.4      Comp.5
## Standard deviation     22.8879807 8.0407225 4.4742688 3.39400594 2.161016463
## Proportion of Variance  0.8273203 0.1021054 0.0316157 0.01819215 0.007375218
## Cumulative Proportion   0.8273203 0.9294257 0.9610414 0.97923357 0.986608787
##                             Comp.6      Comp.7     Comp.8      Comp.9
## Standard deviation     1.812162414 1.646620645 1.36566514 0.602917613
## Proportion of Variance 0.005186244 0.004281992 0.00294542 0.000574083
## Cumulative Proportion  0.991795031 0.996077023 0.99902244 0.999596525
##                             Comp.10      Comp.11      Comp.12      Comp.13
## Standard deviation     0.4234770324 2.221225e-01 1.301262e-01 9.730192e-02
## Proportion of Variance 0.0002832164 7.791911e-05 2.674164e-05 1.495208e-05
## Cumulative Proportion  0.9998797419 9.999577e-01 9.999844e-01 9.999994e-01
##                             Comp.14
## Standard deviation     2.021332e-02
## Proportion of Variance 6.452592e-07
## Cumulative Proportion  1.000000e+00

The results are, however, not invariant with respect to transformations.

PCdata <- scale(PCdata, center = TRUE, scale = TRUE)
summary(prcomp(PCdata))
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.9922 1.3774 1.2203 0.83936 0.57503 0.38570 0.37509
## Proportion of Variance 0.6395 0.1355 0.1064 0.05032 0.02362 0.01063 0.01005
## Cumulative Proportion  0.6395 0.7750 0.8814 0.93170 0.95532 0.96594 0.97599
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.31946 0.25953 0.24952 0.21451 0.19375 0.11730 0.08427
## Proportion of Variance 0.00729 0.00481 0.00445 0.00329 0.00268 0.00098 0.00051
## Cumulative Proportion  0.98328 0.98809 0.99254 0.99583 0.99851 0.99949 1.00000
summary(princomp(PCdata))
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.9690582 1.3667152 1.2108763 0.83287359 0.57059435
## Proportion of Variance 0.6395033 0.1355069 0.1063665 0.05032265 0.02361893
## Cumulative Proportion  0.6395033 0.7750102 0.8813767 0.93169932 0.95531825
##                            Comp.6     Comp.7      Comp.8      Comp.9    Comp.10
## Standard deviation     0.38272054 0.37219820 0.316989404 0.257526895 0.24759625
## Proportion of Variance 0.01062598 0.01004972 0.007289451 0.004811168 0.00444727
## Cumulative Proportion  0.96594423 0.97599395 0.983283402 0.988094570 0.99254184
##                            Comp.11     Comp.12      Comp.13      Comp.14
## Standard deviation     0.212856261 0.192254476 0.1163912359 0.0836145071
## Proportion of Variance 0.003286837 0.002681379 0.0009827565 0.0005071876
## Cumulative Proportion  0.995828677 0.998510056 0.9994928124 1.0000000000

Compare the standard deviation values with the square roots of the eigen values of the sample variance covariance matrix.

sqrt(eigen(var(PCdata))$values)
##  [1] 2.99216408 1.37735123 1.22029957 0.83935520 0.57503484 0.38569896
##  [7] 0.37509473 0.31945628 0.25953103 0.24952310 0.21451275 0.19375064
## [13] 0.11729702 0.08426521

Can you explain why the Standard deviation values provided for two procedures (prcomp() and princomp()) are different? Why the proportions of variance and the cumulative proportions both corresponds? Check the help help(princomp) and focus on the standardization form when calculating the sample variance-covariance matrix: what is the divisor factor used in both fucntions?

Dimension (information) reduction

A simple (visual example) on how the PC analysis actually works (and how it reduces dimensions/information) can be taken by using the following illustration.
  • Firstly, install the library library("jpeg") and load it by

    library("jpeg")
  • Download a sample image from the following address: http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/foto.jpg and save it to your working environment;
  • Load the picture into the R software by using the followinng command (with the right working directory):

    foto <- readJPEG("foto.jpg")
  • Decompose the picture into RGB scale and apply the PC analysis to all three:

    r <- foto[,,1]
    g <- foto[,,2]
    b <- foto[,,3]
    
    foto.r.pca <- prcomp(r, center = FALSE)
    foto.g.pca <- prcomp(g, center = FALSE)
    foto.b.pca <- prcomp(b, center = FALSE)
    
    rgb.pca <- list(foto.r.pca, foto.g.pca, foto.b.pca)
  • Now, we will reconstruct the original picture using 3, 10, 20, 50, 100 and 500 components (out of 682 covariates together).

    for (i in c(3,10,20,50,100,500)) {
      pca.img <- sapply(rgb.pca, function(j) {
        compressed.img <- j$x[,1:i] %*% t(j$rotation[,1:i])
      }, simplify = 'array')
      
      writeJPEG(pca.img, paste("foto_", i, "_components.jpg", sep = ""))
    }
  • Compare the compressed images with the original photo and see how much dimensionality reduction (loss of information in each compressed fiture respectively) is present in each figure;




  • The image compression (a lower dimensional representation with respect to various principal components) can be now used to compare the original image with some other one in order to asses some mutual (important) similarities (image processing).



  • The PCs obtained above used the assumption that all three color channels as same important. On the other hand, this may not be true, therefore, it is also possible to combine all three color channels into just one data matrix and to run the PC on the overal image.


5. Graphical representations of PCs

First of all, it is usefull to know the amount of variability (the proportion of the total variability) which is explained by some first principal components.

pc <- princomp(PCdata)
par(mfrow = c(1,2))
plot(pc, type = "l", col = "red", main = "")
plot(pc, col = "red", main = "")

summary(pc)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.9690582 1.3667152 1.2108763 0.83287359 0.57059435
## Proportion of Variance 0.6395033 0.1355069 0.1063665 0.05032265 0.02361893
## Cumulative Proportion  0.6395033 0.7750102 0.8813767 0.93169932 0.95531825
##                            Comp.6     Comp.7      Comp.8      Comp.9    Comp.10
## Standard deviation     0.38272054 0.37219820 0.316989404 0.257526895 0.24759625
## Proportion of Variance 0.01062598 0.01004972 0.007289451 0.004811168 0.00444727
## Cumulative Proportion  0.96594423 0.97599395 0.983283402 0.988094570 0.99254184
##                            Comp.11     Comp.12      Comp.13      Comp.14
## Standard deviation     0.212856261 0.192254476 0.1163912359 0.0836145071
## Proportion of Variance 0.003286837 0.002681379 0.0009827565 0.0005071876
## Cumulative Proportion  0.995828677 0.998510056 0.9994928124 1.0000000000



Individual tasks


  • Use the output above and plot a figure where one can clearly see the proportions of explained variability for each component.
  • Similarly, from the output above, create a figure with cumulative proportions of the explaied variability.
  • Apply a similar approach (PCA) to the iris dataset available in the R environment (use data(iris)).


There are also other options to visualize the results. A standard graphical tool is available through the commnad biplot() which is available in the standard R installation. This fuction automatically projects the data on the first two PCs. Other principle components can be also used in the figure (parameter ‘choices = 1:2’).

par(mfrow = c(1,2))
biplot(pc, scale = 0.5, choices = 1:2, cex = 0.8)
biplot(pc, scale = 0.5, choices = 1:2, xlim = c(-1,1), ylim = c(-1, 1), cex = 0.8)

A more fancy version can by obtained by installing the ggbiplot package from the Github repository (use command install_github("ggbiplot", "vqv")). After loading in the library one can use the ggbiplot() command (and additional parameters).

library("ggbiplot")
plot1 <- ggbiplot(princomp(PCdata), elipse = T, groups = Lital, circle = T)
plot2 <- ggbiplot(princomp(PCdata), elipse = T, groups = Bindex, circle = T)

grid.newpage()
vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plot1, vp = vplayout(1, 1))
print(plot2, vp = vplayout(1, 2))



Another type of plot can be obtained using the library ‘ggfortify’ (use install.packages('ggfortify', dep = T) to install the library) and the command autoplot().

library("ggfortify")
autoplot(prcomp(PCdata), data = data,  colour = 'SaprInd', loadings = T)



Last but not least, there is also a complex command fviz_pca() available in the library ‘factoextra’ (use install.packages("factoextra") for installation).

library("factoextra")
par(mfrow = c(1,3))
fviz_pca_ind(pc)

fviz_pca_var(pc)

fviz_pca_biplot(pc)







Homework Assignment

(Deadline: 12.04.2022)

Use the command below to instal the SMSdata packages with various multivariate datasets:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

Chose one dataset of your choise from the list of all available datasets in the package:

data(package = "SMSdata") 

There are 21 different datasets and you can load each of them by typing and running data(dataset_name), for instance:

data(plasma) 
  • Choose one dataset and apply the principal component analysis;
  • Discuss the right choice of the number of principal components. Provide some graphical output and interpret the results;