NMST539 | Lab Session 9

Multidimensional Scaling with Applications in R

Summer Term 2021/2022 | 19/04/22

source file (UTF8 encoding)

The R software is available for download (free of charge) from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
  • Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
  • Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
  • Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)



Theoretical references
  • Hardle, W. and Simar.L.: Applied Multivariate Statistical Analysis. Springer, 2015
  • Mardia, K., Kent, J., and Bibby, J.:Multivariate Analysis, Academic Press, 1979.

1. Multidimensional Scaling (MDS)

The multidimensional scaling technique is another common method in statistics used for the high dimensional data visualization and dimensionality reduction. However, unlike the principal components analysis (and somehow factor analysis as well) the multidimensional scaling rather focuses on visualization of common similarities/dissimilarities in observations rather than visualizing the major variability directions in the data (the overall covariance structure respectively).

The main idea of the multidimensional scaling technique is to identify meaningful underlying dimensions that allow us to detect existing similarities and dissimilarities between the observed data points. There are of course many different approaches how to define these similarities and dissimilarities respectively. If we choose to measure the similarities and dissimilarities among the given variables in a sense of a classical correlation matrix then we obtain the classical factor analysis approach. On the other hand, if we choose to measure similarities/dissimilarities using a standard Euclidian distance then we end up with the principal component analysis. Many other options are, however, possible too.

The starting point for the MDS analysis (algoritm) is, so called, the matrix of distances (similarity/dissimilarity matrix) between all pairs of observations while the underlying dimensionality of the data under investigation is, in general, unknown. Hence, the main task is to find a set of points in some low dimension (usually as low as possible) which would reasonably represent the spatial configuration of the original data.

In some situations the multidimensional scaling approach can be also performed for a similarity/dissimilarity matrix wich is not based on a typical distance (see a nonmetric MDS approach below) but some discrete ranks for instance, or a monotone function of the ranks.

In the statistical software R one can use a standard function ‘dist()’ to calculate similarities/dissimilarities (see the help session for further details) wich is available under the standard R instalation. Such matrix can be consequently used for the MDS algorithm which represents the observations in (usually) a lower dimensional plane in a way that the original distances are preserved as well as possible.

We will again start with the dataset which respresents different river localities in the Czech republic. The biologoical life diversity is measured by a set of 17 various bio metrics and we are interested in identifying similar localities in the given dataset. The covariate Ntaxon is categorized into three groups for better visualization purposes.

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)

bioData$Ntaxonfac <- cut(bioData$Ntaxon, breaks = c(0, 12, 18, 40), labels = c("low", "medium", "high"))

Dmatrix <- dist(bioData[,2:18]) 
DmatrixScaled <- dist(scale(bioData[,2:18]))

dim(as.matrix(Dmatrix))
## [1] 65 65



Individual tasks


  • What is the right interpretation of the values in the \(\mathbb{D}\) matrix?
  • By default, the R function dist() uses the Euclidian distance to calculate the dissimilarities among the given observations. How exactly are these values calculated?
  • Try to reconstruct the values (at least some of them) in the \(\mathbb{D}\) matrix calculated by the R function dist().

    sqrt(sum((bioData[1,2:18] - bioData[2,2:18])^2))
    ## [1] 20.55428

    And compare it directly with the values in the \(D\) matrix:

    as.matrix(Dmatrix)[1:2,1:2]
    ##          1        2
    ## 1  0.00000 20.55428
    ## 2 20.55428  0.00000

The matrix of distances of course depends on the scale of the original data. Therefore, the multidimensional scaling is not invariant with respect to different scales.



After applying the multidimensional scaling algorithm one can use the output in order to visualize the original data (no the original data themselves but rather some lover dimensional approximation of the original data). Formally speaking, usually there is a set of new artificial data points visualized in the plot (e.g., a scatterplot) such that the similarity/dissimilarity matrices of the original data and the artificial data as are close to each other as possible.

mds <- data.frame(cmdscale(DmatrixScaled))


library(ggplot2)
ggplot(mds, aes(x = X1, y = X2)) + 
  geom_label(aes(fill = bioData$Ntaxonfac, label = bioData[,1]))

Above, there is a simple scatterplot of a two-dimensional artificial dataset. The difference (differences) between the similarity/dissimilarity matrix of the artificial dataset and the original dataset was (were) minimized by the multidimensional scaling algorithm.

In some sense, we can say that there is no other two-dimensional scatterplot (a scatterplot of two arbitrary covariates from the original dataset) such that the other scatterplot would offer a more complex insight into the true data structure.

Note, that, formally speaking, there is no information about the original data dimensionality provided in the input of the multidimensional scaling algorithm. Indeed, the only input is given in terms of the given similarity/dissimilarity matrix (of the type \(n \times n\)).




Multdimensional scaling with the Euclidian metric

Let the distance matrix \(\mathbb{D} = (d_{ij})_{i, j = 1}^{n}\) is defined via the Euclidean metric \[ d_{ij}^2 = \sum_{k = 1}^{p} (x_{i k} - x_{j k})^2, \] for some two \(p\)-dimensional points \(\boldsymbol{x}_i = (x_{i1}, \dots, x_{ip})^\top \in \mathbb{R}^p\) and \(\boldsymbol{x}_j = (x_{j1}, \dots, x_{jp})^\top \in \mathbb{R}^p\). Let us define the following quantities: \[ a_{ij} = - \frac{d_{ij}^2}{2}, \quad\quad a_{i \bullet} = \frac{1}{n}\sum_{j = 1}^{n} a_{ij}, \quad\quad a_{\bullet j} = \frac{1}{n}\sum_{i = 1}^{n} a_{ij}, \quad\quad a_{\bullet \bullet} = \frac{1}{n^2} \sum_{i = 1}^n \sum_{j = 1}^n a_{ij}. \] Using the notation above prove that the following expressions hold:

  • For matrix \(\mathbb{B} = (b_{ij})_{i,j = 1}^n\), where \(b_{ij} = a_{ij} - a_{i \bullet} -a_{\bullet j} + a_{\bullet \bullet}\), it holds that \(\mathbb{B} = \mathcal{X}\mathcal{X}^\top\), where \(\mathcal{X}\) is the original \(n \times p\) type data frame \(\mathcal{X} = (\boldsymbol{x}_1, \dots, \boldsymbol{x}_n)^\top\), where \(\boldsymbol{x}_i = (x_{i1}, \dots, x_{ip})^\top \in \mathbb{R}^p\).

  • Show that \(b_{ii}= a_{\bullet \bullet} - 2a_{i \bullet}\), for \(i = 1, \dots, n\).

  • Show that \(\mathbb{B} = -\frac{1}{2} \mathcal{H} \mathbb{D}^2 \mathcal{H}\), where \(\mathcal{H}\) is the centering matrix defined as \(\mathcal{H} = (\mathbb{I}_n - \frac{1}{n} \boldsymbol{1}_n\boldsymbol{1}_n^\top)\).


Various measures of similarities and dissimilarities in R

There are various distance definitions which can be used to calculate mutual similarities/dissimilarities between the pairs of observations. Considering the R function dist() one can take an advantage of the following distances:

  • Euclidean distance - used by default;
  • Maximum distance - equivalent with a supremum (maximum) norm;
  • Manhattan distance - absolute distance between the two vectors (1 norm aka \(L_1\));
  • Canberra distance - a weighted version of the Manhattan distance;
  • Binary distance - a proportion of non-zero elements in \(\boldsymbol{x}−\boldsymbol{y}\);
  • Minkowski distance - classical \(L_p\)-norm (default choice is \(p=2\) which reduces to Euclidean distance);



Let us get back to the dataset with the bio metrics in different localities in the Czech republic. If we stick with the Euclidian metric (default option) for calculating the distances (similarities respectively dissimilarities) between different localities (with respect to 17 available bio metrics) we already have the corresponding matrix stored in the R object Dmatrix.

A standard function in the R environment which performs the multidimensional scaling of the dataset is cmdscale(). It is available under the standard R installation.

MDS1 <- cmdscale(Dmatrix, k = 2)
dim(MDS1)
## [1] 65  2
MDS2 <- cmdscale(DmatrixScaled, k = 2)
dim(MDS2)
## [1] 65  2

What do the results provided above actually respresent? What is the corresponding interpretation?

We can use some scatterplot options to get some insight into the data however, only a very limited information can be given by the means of some 2D scatterplot figures. For instance, see the following plots:

par(mfrow = c(2,2))
cols <- rep("blue", dim(bioData)[1])
cols[which(bioData[,1] == "Olse")] <- "red"
cols[which(bioData[,1] == "DesnSudk")] <- "green"
cols[which(bioData[,1] == "DyjeVran")] <- "green"


plot(bioData[,2], bioData[,3], xlab="SaprInd", ylab="Lital",  type="n", main = "Czech River Localities")
text(bioData[,2], bioData[,3], labels = bioData[,1], cex=.5, col = cols)

plot(bioData[,2], bioData[,11], xlab="SaprInd", ylab="Ntaxon",  type="n", main = "Czech River Localities")
text(bioData[,2], bioData[,11], labels = bioData[,1], cex=.5, col = cols)

plot(bioData[,2], bioData[,17], xlab="SaprInd", ylab="MDS_1",  type="n", main = "Czech River Localities")
text(bioData[,2], bioData[,17], labels = bioData[,1], cex=.5, col = cols)

plot(bioData[,17], bioData[,18], xlab="MDS_1", ylab="MDS_2",  type="n", main = "Czech River Localities")
text(bioData[,17], bioData[,18], labels = bioData[,1], cex=.5, col = cols)

Plotting the original data with using two dimensions only (but still preserving original distances as well as possible) can be done by the following command:

plot(MDS1[,1], MDS1[,2], xlab="Coordinate 1", ylab="Coordinate 2",  type="n", main = "Czech River Localities")
text(MDS1[,1], MDS1[,2], labels = bioData[,1], cex=.7, col = cols)



The same plot, however, provided with respect to a scaled dataset:

plot(MDS2[,1], MDS2[,2], xlab="Coordinate 1", ylab="Coordinate 2",  type="n", main = "Czech River Localities")
text(MDS2[,1], MDS2[,2], labels = bioData[,1], cex=.7, col = cols)

The results are clearly different for the multidimensional scalling algorithm based on the original data and the scaled dataset. On the other hand, the overall interpretation with respect to some similarities and dissimilarities with respect to the given geographical localities are roughly equivalent (this, however, may not be the case in general). Note, that the plot itself does not take into account any geographical distances between the given localities. The only information relates to some biological diversity of the localities.

Considering, for instance, the locality denoted as Olše, the most similar (other) localities seem to be (based on the multidimensional scaling algorithm) the localities denoted as Moravka, Tyra, and VidnVidn – for both, scaled and unscalled version of the data.




Individual tasks

  • Use the centered data and reconstruct the results again. Compare the results also with those obtained after applying the principal component analysis (PCA).
  • Use the same dataset on different localities in the Czech Republic and calculate the similarity/dissimilarity matrix using some other distance (matric) available in dist(). Plot the outputs and compare the results
  • Load in also the dataset with the chemical characteristics on the same river localities and try to perform a multidimensional scaling on the chemical dataset (note, that one locality is missing in the chemical data set). Use the following command to load in the corresponding dataset with chemical measurements:

    chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)
  • Can you indentify some localities which are similar with respect to both: the chemical and biological characteristics?



chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)
ind <- match(chemData[,1], bioData[,1])
bioData0 <- bioData[ind, ]

MDS3 <- cmdscale(dist(bioData0[,2:18]), k = 2)
MDS4 <- cmdscale(dist(chemData[,2:8]), k = 2)

par(mfrow = c(2,1))
plot(c(MDS3[,1]), c(MDS3[,2]), xlab="Coordinate 1", ylab="Coordinate 2",  type="n", main = "Czech River Localities", xlim = c(-80, 100), ylim = c(-40, 20))
text(MDS3[,1], MDS3[,2], labels = bioData[,1], cex=.7)

legend(50, -30, legend = c("Bio Metric Measurements"), col = c("black"), pch = 15)

plot(c(MDS4[,1]), c(MDS4[,2]), xlab="Coordinate 1", ylab="Coordinate 2",  type="n", main = "", xlim = c(-80, 100), ylim = c(-40, 20))
text(MDS4[,1], MDS4[,2], labels = bioData[,1], cex=.7, col = "red")

legend(50, -30, legend = c( "Chemical Concentrations"), col = c("red"), pch = 15)




Questions to think about

  • Is the representation given in the figures above unique?
  • How much representative is the figure above?
  • What kind of information can be effectivelly extracted from this figure?



2. Classical MDS vs. Nonmetric MDS

The key difference between these two approaches is that the first one uses the matrix of similarities/dissimilarities being defined in a sence of a classical distance while the second one only uses some matrix of ordered ranks of similarities/dissimilarities (an arbitrary monotone function of distances).

For nonmetric MDS one can use the R function isoMDS() which is available in the library ‘MASS’. The function performs the Kruskal’s Non-metric Multidimensional Scaling.

library(MASS)
MDS2 <- isoMDS(Dmatrix, k=2)
## initial  value 7.466727 
## iter   5 value 6.387632
## iter  10 value 6.118217
## iter  15 value 5.959384
## final  value 5.817192 
## converged

The results can be again plotted using an analogous set of commands:

plot(MDS2$points[,1], MDS2$points[,2], xlab="Coordinate 1", ylab="Coordinate 2", type="n", main = "Czech River Localities")
text(MDS2$points[,1], MDS2$points[,2], labels = bioData[,1], cex=.7)


Questions

  • Which are the suitable situations for using the nonmetric MDS?
  • In which situations would you prefer the classical MDS (based on a regular distance/metric) instead?



Beside the two functions already mentioned above (cmdscale() and isoMDS()) there are many more options available in different R packages (see, for instance, smacofSym() in the R package ‘smacof’, wcmdscale() in package ‘vegan’ or pco() in package ‘ecodist’).

Homework Assignment

(voluntarily)

Use some appropriate dataset – for instance, from the R library SMSdata which can be installed by the following command:

install.packages(pkgs="http://www.karlin.mff.cuni.cz/~hlavka/sms2/SMSdata_1.0.tar.gz", repos=NULL, type="source")

The list of all datasets in the R library SMSdata can be obtained by the command:

data(package = "SMSdata") 
  • Apply the multidimensional scaling algorithm for at least two different similarity/dissimilarity measures which you find to be reasonable;
  • Visualize the outputs and briefly discuss the results;