NMST539 | Lab Session 10

Cluster Analysis & Applications in R

Summer Term 2021/2022 | 26/04/22

source file (UTF8 encoding)

The R software is available for download (free of charge) from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

  • Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
  • Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
  • Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
  • Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
  • De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)



Theoretical references
  • Hardle, W. and Simar.L.: Applied Multivariate Statistical Analysis. Springer, 2015
  • Mardia, K., Kent, J., and Bibby, J.:Multivariate Analysis, Academic Press, 1979.

1. The main principles of the cluster analysis

The cluster analysis is a statistical method designed for distinguishing various objects, respectively grouping them into some disjoint sets according to some similarity/dissimilarity measure (similar/homogeneous observations are expected to be in the same clusters and different/heterogeneous observations are expected to form rather different clusters). Mutual similarity/dissimilarty is usually measured with respect to some proximity measure or some distance (metric respectively) – see the R function dist() and the corresponding help session. However, the cluster analysis is not just one specific method/approach but it is rather a whole set of various tools and algoritms, which can be used to solve a general grouping problem by assigning variables into some disjoint groups.

It is also important to understand the main principal difference between clustering algorithms and clasification approaches (sometimes these two are not distinguished properly).
  • Clustering algorithm runs witout any apriori information about the true assignment of objects into the disjoint groups (clusters). The clusters are assumed to be unknown and the aim of the clustering algorithm is to use mutual similarities/dissimilarities to define the clusters in the output (while the given observations are assigned into clusters).
  • Clasification is usually run on a dataset where the true assignment of observations into disjoint groups is apriori known (so called labels). The idea of the classification algorithm is to learn the assignment principle using the given data and, later, to apply the knowledge on new observations where the label is unknown.

From the theoretical point of view we distinguish between two types of the clustering algorithms: partitioning algorithms where the assignments of objects into given groups can change along the algorithmic path and hierarchical algoritihms, where the assignments of any object is kept fixed once the assignment of this object is already performed.

The result of the clustering algorithm is mainly affected by the choice of the distance/proximity matrix and the clustering method used for the object assignment (choice of some specific clustering algorithm). There are many different clustering approaches and algorithms available in various statistical software packages and many modifications of the exisitng algorithms are still being proposed. Although, the most commonly known are the following:

  • Partitioning clustering
  • Hierarchical clustering
  • Centroid-based clustering
  • Distribution-based clustering
  • Density-based clustering
  • and many others…



Note


Cluster analysis is mostly considered to be an exploratory tool. In some sense it can be also seen as a dimension reduction technique. Considering the practical point of view, it is usually common to determine the number of clusters in a way that some convenient interpretation of the clusters can be used afterwards. Consequently, instead of using the set of original covariates, one can use the information about the cluster assignment (dimension reduction).



2. Hierarchical clustering

Hierarchical (aglomerative) clustering is based on the ‘bottom-up’ principle when firstly small clusters are formed (one observation in one cluster) and, later, in the consequent steps of the algorithm, small cluester are merged into bigger ones until just one big cluster (contatining all available observations) is created. When merging small clusters into bigger ones the same principle of mutual similarities/dissimilarities between obsercations and clusters is used again (e.g., proximity measure).

Aglomerative hierarchical algorithm are relatively simple. This is mainly due to the fact that the similarity/dissimilarity measure between the new clusters can be calculated from the similarity/dissimilarity measure between the clusters in the previous step of the algorithm. Thus, the input for the algorithm is the distance matrix (distance between all pairs of the original observations).

Hierarchical divisive clustering methods generate a classification in a top-down manner, by progressively sub-dividing one single cluster which represents an entire dataset. Hierarchical methods tend to be very demanding of computational resources since a complete hierarchy of partitions has to be built-up rather than just a single partition.

The output of the clustering algorithm is usually visualized graphicaly. Still, the main question relates to the number of cluester that should be used.



Let us consider the data on the consumption of automobiles in the United states in 80’s (the dataset ‘mtcars’ available in R):

rm(list = ls())
data <- mtcars
attach(data)
head(data)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1



The data consists of 32 observations (32 different cars on the US market available that time) and 11 covariates (some of them are continious, some of them are categorical). For an additional information on this dataset use the help session in R (type ?mtcars). Let us consider all covariates starting by calculating the distance matrix. In the following we are using the default option – the Euclidian distance for the matrix of distances (similarity/dissimilarity (proximity) measure).

D <- dist(mtcars) ### for the euclidian distance by default

Now, we can apply function hclust() (available under the standard R instalation) to run a hierarchical (agglomerative) clustering approach based on the proximity (distance) matrix \(\mathbb{D}\). See the help session in R for more details (type ?hclust()).

HC1 <- hclust(D)
plot(HC1, xlab = "Observations", ylab = "Proximity measure")

The default clustering method in hclust() is the complete linkage approach which uses the furthest neighbor to create groups where all points are relatively close to each other. Different aglomerative approaches are possible with different settings of the ‘method’ parameter. The following options are possible:

  • ‘method = ’single’ (nearest neighbor approach)
  • ‘method = ’complete’ (furthest neighbor apprach)
  • ‘method = ’average’ (compromise between the single and complete linkage)
  • ‘method = ’median’ (robust version of the average method)
  • ‘method = ’centroid’ (apprach which facilitates the geometrical distance)
  • ‘method = ’mcquitty’ (weighted version of the average method)
  • ‘method = ’ward.D’ (joins groups that do not increase too much a given measure of heterogeneity)



A little more fancy version available in the R software is the package called ‘sparcl’, which needs to be installed first (use install.packages('sparcl') for the instalation). After loading the library the corrresponding command is ColorDendrogram() (see the help session for more details).

In order to specify groups (prespecified number) with respect to the proximity measure we can use the following R code:

plot(HC1, xlab = "Observations", ylab = "Proximity measure")
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")

One can also improve the overall impression of the final dendogram by considering an aligned version with some additional parameter settings.

plot(HC1, xlab = "Observations", ylab = "Proximity measure", hang = -1)
groups <- cutree(HC1, k=3)
rect.hclust(HC1, k=3, border="red")

and even more sophisticated alternatives, e.g.:

cutHC1 = cutree(HC1, 3)
labelColors = c("red", "green", "blue")

# function to get color labels
colLab <- function(n) {
    if (is.leaf(n)) {
        a <- attributes(n)
        labCol <- labelColors[cutHC1[which(names(cutHC1) == a$label)]]
        attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
    }
    n
}
# using dendrapply
clusDendro = dendrapply(as.dendrogram(HC1), colLab)
plot(clusDendro, main = "Cool Dendrogram", type = "triangle")

The same idea of the code can be also used to define colors of branches with respect to some additional covariate (for instance, the covariate gear).

local({
  colLab <<- function(n) {
    if(is.leaf(n)) {
      a <- attributes(n)
      i <<- i+1
      attr(n, "edgePar") <-
        c(a$nodePar, list(col = mycols[i], lab.font= i%%3))
    }
    n
  }
  mycols <- mtcars$gear
  i <- 0
})

x.clust.dend1 <- dendrapply(as.dendrogram(HC1), colLab)
plot(x.clust.dend1)



Individual tasks


  • Consider the standard clustering function in R (function hclust()) and use various options for calculating the distance matrix \(\mathbb{D}\).
  • Consider different types of distances for function dist() and apply the aglomerative clustering approach while selecting different methods for the clustering algorithm (various settings of the ‘method =’ parameter when calling hclust() function.
  • Use dendogram pictures and compare results from different distances and different clustering approaches.
  • How many clusters seem to be a reasonable choice and what would be the natural interpretation of them?



2. Non-hierarchical Clustering in R

The most common algorithm used for the non-hierarchical clustering is called the K means algorithm. The idea behind the algorithm is to partition the observations into some prespecified number of groups (\(K\) groups – thus the name of the method) such that the sum of squares from points to the assigned cluster centres is minimized.

Suppose we have \(K\) possible means for each observed value \(\boldsymbol{x}_i \in \mathbb{R}^p\) for \(i = 1,\dots, n\). Theoretically, we can use a formulation that \[ E[\boldsymbol{x}_i | k_i] = \mu_{k_{i}}, \] where \(k_{i} \in \{1, \dots, K\}\). The total number of clusters is \(K \in \mathbb{N}\). In other words, if \(k_i = 1\), then the corresponding observation \(\boldsymbol{x}_i\) belongs to cluster 1. The unknown mean parameters \(\boldsymbol{\mu}_{k}\), for \(k = 1, \dots, K\) are to be estimated by the algorithm. However, the estimates may change a lot during the algorithm process (the estimates clearly depends on the observations which are (or which are not) assigned to the given cluster). The output of the clustering algorithm is some specific configuration of observations (and clusters) such that for the given number of clusters the mean squared error quantity \[ \sum_{k =1}^{K} \sum_{i \in \pi(k)} (\boldsymbol{x}_i - \boldsymbol{\mu}_k)^2 \] is minimized, where \(\pi(k) = \{i \in \{1, \dots, n\};~E[\boldsymbol{x}_i | k] = \boldsymbol{\mu}_k\}\) is the set of all observations bellonging to the cluster \(k \in \{1, \dots, K\}\).

Each mean parameter \(\mu_{k}\), for \(k = 1, \dots, K\) has an associated probability or ‘’weight’’ in the overall ‘’mixture’’. For some new observation with unknown cluster assignment we can write \[ E[x] = P(k = 1) \mu_1 + \dots + P(k = K) \mu_{K}. \]

In the R software the function which performs the non-hierarchical K-means clustering is kmeans() available under the standard R instalation (see the help session for more details). The goal of the algorithm is to find \(K\) clusters (the number of clusters is given in ahead) so that the within-cluster variability is small. However, we do not know the cluster centroids (K-means) \(\mu_k\).

This reminds a classical chicken-and-egg problem: if we knew the right cluster assignment \(k_i\) for some observation \(\boldsymbol{x}_i\), we could easily estimate the corresponding group mean \(\mu_{k}\). On the other hand, if we knew the corresponding cluster means (centroids) \(\mu_k\), for \(k = 1, \dots, K\), we could easily calculate the corresponding assignment \(k_i\). The solution to the problem is to use iterations between choosing the assingment \(k_i\) and the estimating the group mean \(\mu_k\).



Kmeans <- kmeans(mtcars, centers = 3)
colVector <- as.numeric(Kmeans$cluster)

plot(mtcars$mpg ~ mtcars$hp, bg = colVector, xlab = "Horse Power", ylab = "Miles per Gallon", pch = 21, col = "black")
points(Kmeans$centers[,1] ~ Kmeans$centers[,4], col = 1:3, pch = 8, cex = 2)
text((mtcars$mpg + 1) ~ mtcars$hp,labels=rownames(mtcars), col=colVector, cex = 0.5)



Recall, that we are still visualizing a multivariate dataset using just a two-dimensional scatterplot. The plots do not offer a complex insight. The interpretation may not be straightforward. However, the situation changes when using another two (original) covariates to obtain the scatterplot.



plot(mtcars$mpg ~ mtcars$disp, bg = colVector, xlab = "Displacement", ylab = "Miles per Gallon", pch = 19 + cyl/2, col = "black")
points(Kmeans$centers[,1] ~ Kmeans$centers[,3], col = "black", pch = 21, cex = 2, bg = 1:3)
text((mtcars$mpg + 1) ~ mtcars$disp,labels=rownames(mtcars), col=colVector, cex = 0.5)
legend(420, 34, legend = c("4 Cylinders", "6 Cylinders", "8 Cylinders"), pch = c(21, 22, 23), bg = "grey")



Alternative approaches can be found in addtional R packages and libraries. For instance, the R library cluster contains the function pam() (see the help session for more details about the implementation) which performs a non-hierarchical partitioning clustering (similarly as the K-means algorithm) but it also allows for dissimilarity matrix and it minimizes sum of absolute differences (median estimation) rather than the sum of the squared differences (mean estimation).

library("cluster")
clusplot(pam(D,3, diss = FALSE), shade = FALSE,labels=2,col.clus="blue",col.p="red",span=F, main="",cex=0.5)



Some more packages and ideas related to the K-means algorithm can be found, for instance, here:
https://uc-r.github.io/kmeans_clustering



Note


  • The scale of the data matters: if you replace some covariate \(X_{j}\) with \(2 X_{j}\) then the corresponding dimension counts twice as much in determining the corresponding distance from the center (and will have more influence on the cluster membership).
  • The recommended approach is to consider standardized data. Then for the centers \(\mu_{k j}\) for \(k = 1, \dots, K\) and covariate \(j = 1, \dots, p\) the clusters are interpreted as standard deviations from the marginal averages.



Individual task


  • Consider the example above, however, standardaze the data first. Consider three clusters and run the non-hierarchical k-means algorithm to obtain three clusters. Compare the results with the output presented above.



Another example relates to the protein consuption in Europe. The corresponding data can be loaded in (and properly scaled) by the following commands:

protein <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/protein.csv", header = T, row.names=1)
sprotein <- scale(protein) 
head(protein)
##                RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts Fr.Veg
## Albania           10.1       1.4  0.5  8.9  0.2    42.3    0.6  5.5    1.7
## Austria            8.9      14.0  4.3 19.9  2.1    28.0    3.6  1.3    4.3
## Belgium           13.5       9.3  4.1 17.5  4.5    26.6    5.7  2.1    4.0
## Bulgaria           7.8       6.0  1.6  8.3  1.2    56.7    1.1  3.7    4.2
## Czechoslovakia     9.7      11.4  2.8 12.5  2.0    34.3    5.0  1.1    4.0
## Denmark           10.6      10.8  3.7 25.0  9.9    21.9    4.8  0.7    2.4

In the following, we will only use the information about the consumption of red and white meat – thus, two covariates only. The covariates are standardized (see above) and the K-means algorithm is applied afterwards.

subset<-sprotein[,(1:2)]
plot(subset,pch=19,col=4,cex=0.2)
text(subset,labels=rownames(protein))

We are only interested in the overall consumption of red and white meat in European countries and we want to find countries which are similar with respect to the meat consumption. Let us consider three clusters.

grpMeat<- kmeans(subset, center=3, nstart=10)

plot(subset,pch=19,col=4,cex=0.2)
text(subset,labels=rownames(protein), col=grpMeat$cluster+1)
means<-grpMeat$centers
points(means,col=c(2,3,4),cex=2,pch=4)

In all the exercises above we choosed the number of clusters – value \(K\) - apriori. However, if no prior knowledge about \(K\) is avialable in ahead, how to choose the number of clusters? There are some recommendations to use:

  • Most often the clustering methods are exploration tools, therefore, it is recommended to choose \(K\) such that it makes the most sense with respect to the final interpretaion.
  • Another approach to choose \(K\) is to apply a data-based model building technique and to built models for a sequence of \(K_1 < K_2 < \dots, K_M\) and to use some selection tool to choose the best model.
  • We can use some information criterion (BIC, AIC) or cross-validation technique to evaluate the within sum of squares in the given clusters. The corresponding degrees of freedom are \(K \times p\), where \(K\) is the number of clusters and \(p \in \mathbb{N}\) is the number of dimensions (thus, one mean parameter for each cluster and each dimension).



The overall total sum of squares for the meat conssumption data (using three clusters and the K-means algorithm):

mean_overall<-apply(subset,2,mean)
means_between<-(t(means)-mean_overall)^2
sum(t(means_between)*grpMeat$size)
## [1] 35.96535

For the within sum of squares we obtain:

grpMeat<- kmeans(subset, center=3, nstart=10)

plot(subset,pch=19,col=4,cex=0.2)
text(subset,labels=rownames(protein), col=grpMeat$cluster+1)
means<-grpMeat$centers
points(means,col=c(2,3,4),cex=2,pch=4)

for(i in (1:nrow(protein))){  
    if (grpMeat$cluster[i]==1){
        lines(c(subset[i,1],means[1,1]), c(subset[i,2],means[1,2]),col=2)
    }
    if (grpMeat$cluster[i]==2){
        lines(c(subset[i,1],means[2,1]), c(subset[i,2],means[2,2]),col=3)
    }
    if (grpMeat$cluster[i]==3){
        lines(c(subset[i,1],means[3,1]), c(subset[i,2],means[3,2]),col=4)
    }}

And for the between sum of squares we have:

plot(subset,pch=19,col=grpMeat$cluster+1,cex=0.2)

means<-grpMeat$centers
points(means,col=c(2,3,4),cex=2,pch=4)
points(mean_overall[1],mean_overall[2],pch=4,lwd=3)

lines(c(mean_overall[1],means[3,1]), c(mean_overall[2],means[3,2]), col=4)
lines(c(mean_overall[1],means[2,1]), c(mean_overall[2],means[2,2]), col=3)
lines(c(mean_overall[1],means[1,1]), c(mean_overall[2],means[1,2]), col=2)

text(c(10,12,9),c(10,8,6),grpMeat$size)

The sum of squares are (similarly as in a linear regression appoach) decomposed into three major parts: the within sum of squares, which can be obtained as

grpMeat$tot.withinss
## [1] 12.03465

the between sum of squares, given as

grpMeat$betweenss
## [1] 35.96535

and, finaly, the total sum of squares which is defined as the sum of the previous too sums of squares:

grpMeat$tot.withinss+grpMeat$betweenss
## [1] 48
plot(subset,pch=19,col=4,cex=0.2)
text(subset,labels=rownames(protein), col=grpMeat$cluster+1)
means<-apply(subset, 2, mean)
points(means[1], means[2],bg="black",cex=2,pch=21)

for(i in (1:nrow(subset))){  
    lines(c(subset[i,1],means[1]), c(subset[i,2],means[2]),col="black", lty = 3)
}



Individual Task


  • Use different starting values for the k-means algoritm and compare the results.
  • Explore some other methodologies and approaches which are avialable in different R packages and R functions.









Homework Assignment

(Deadline: Lab Session 11 / 03.05.2022)


  • Use some appropriate data set (ideally the same you used for your previous homework assignments) and apply the K-means algorithm using some reasonable choice for the number of clusters.
  • Use some hierarchical aglomerative clustering algorithm using the same number of clusters as you used for the K-means algorithm and compare the results. Provide some graphical output and some brief interpretation at least.
  • Add your solution to your personal web page, on Monday, 02.05.2022, 23:59 (CET) at latest.