NMST539 | Lab Session 12

Discriminant Analysis & Classification

Summer Term 2021/2022 | 03/05/22

source file (UTF8 encoding)

The R software is available for download (free of charge) from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

Theoretical references

Hardle, W. and Simar.L.: Applied Multivariate Statistical Analysis. Springer, 2015
Mardia, K., Kent, J., and Bibby, J.:Multivariate Analysis, Academic Press, 1979.

1. Classification via discriminant analysis

Discriminant analysis is (not necessarily a multivariate) statistical method designed for allocating various objects into similar (homogenous) groups. The idea is to use some available information (given covariates) in order to create some classification rule which can be used later to assign new objects into one of the existing groups – so called subject classification. Geometrically speaking, the discriminant analysis can be seen as some specific partition of the whole sample space – usually \(\mathbb{R}^p\). Recall the main difference between the clustering algorithms and the classification approaches:

Classification is applied on a dataset in which the groups of subjects (observations) – so called labels – are already known. The classification algorithm uses the data to learn some principle in order to differentiate new observations into the given groups.
Clustering algorithms use no prior information regarding the underlying groups (no labels). The groups are unknown and the algorithm tries to identify the groups by considering mutual similarities/differencies in the given observations.

As mentioned above, the classification algorithm tries to learn some underlying principle – a discimination rule – to distinguish heterogeneous observations (which should be ideally classified in different groups) and to classify same observations which are rather similar. There are of course many different approaches to define some reasonable descrimination rule. Such rules can be based, for instance, on the prior probabilities (Bayes rule), the likelihood approach, or more complex (sophisticated neural network) algorithms.

Using a linear regression jargon, the discriminant analysis can be also seen as an alternative/counterpart for a linear regression model (continuous dependent variable and continuous or discrete predictor variables) used mainly to analyze the data when the criterion of interest (i.e., dependent variable) is categorical and the predictors are continuous (or discrete) covariates. The term categorical variable means that the dependent variable is divided into a specific number of different and disjoint categories.

In practical applications, the main focuss is given on a linear discriminant analysis (LDA) (a discrimination rule based on a maximum likelihood approach for two multivariate normal populations with the same variance-covariance matrix) which is a statistical method meant to find a linear combination of features (usually a set of continuous random variables) that characterizes or separates two (or more) classes of objects. The resulting linear combination is used as a linear classifier. The linear discriminant analysis is also closely related to the analysis of variance (ANOVA) and the regression analysis, which both attempt to express one dependent variable as a linear combination of other features or measurements. However, unlinke ANOVA, which uses categorical independent variables and a continuous dependent variable, the discriminant analysis works other-way around – it uses continuous independent variables and a categorical dependent variable instead.

Alternatively, the logistic regression approach is more similar to LDA than ANOVA as it also explains a categorical variable by using a set of continuous (discrete) independent variables. However, GLM methods are preferred in applications where it is not reasonable to assume that the data variables are normally distributed, which is a fundamental assumption of the LDA method. A straightforward extention of LDA is a quadratic discrimination rule – a discrimination rule again based on a maximum likelihood approach for two multivariate normal populations, however, with different variance-covariance matrices).

If the normality assumptions is too much restrictive, the Fisher’s approach is usually adopted instead. It finds the linear discrimination rule in a way that the between class variability and the within class variability are optimized. This method does not even require homoscedastic errors.

Suppose that two groups of observations have their means denoted as \(\boldsymbol{\mu}_1\) and \(\boldsymbol{\mu_2}\) and the corresponding variance-covariances matrices are \(\Sigma_1\) and \(\Sigma_2\). Thus, there are observations \(\boldsymbol{X}_1, \dots, \boldsymbol{X}_n \sim (\boldsymbol{\mu}_1, \Sigma_1)\) from the first group (only the mean and variance-covariance are specified) and other observations \(\boldsymbol{Y}_1, \dots, \boldsymbol{Y}_m \sim (\boldsymbol{\mu}_2, \Sigma_2)\) (the dimension observations from both groups is the same, thus \(\boldsymbol{\mu}_1, \boldsymbol{\mu}_2 \in \mathbb{R}^p\)).

The Fisher’s linear separation measure between these two groups (the corresponding distributions respectively) is defined as a ratio of the variance between the groups to the variance within the groupds, which is \[ S = \frac{\sigma_{between}^2}{\sigma_{within}^2} = \frac{(\boldsymbol{w}^\top \boldsymbol{\mu}_2 - \boldsymbol{w}^\top \boldsymbol{\mu}_1)^2}{\boldsymbol{w}^\top (\Sigma_{1} + \Sigma_2) \boldsymbol{w} }. \]

The idea is to find an appropriate vector of a linear combination \(\boldsymbol{w} \in \mathbb{R}^p\) such that the expression above is maximized. The quantity measures, in some sense, the signal-to-noise ratio for the class labeling and it can be shown that the maximum separation occurs for \[ \boldsymbol{w} = (\Sigma_1 + \Sigma_2)^{-1} (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1). \]

2. Discriminant analysis in R

The linear discriminant analysis (LDA) is implemented in the R software in the function lda() which is available in the packages ‘MASS’. Similarly, the quadratic discriminant analysis (QDA) is implemented by the command qda() (also in the package ‘MASS’).

In the following, let us again consider the mtcars data set. We use the subset of the continuous covariates only and we search for a linear classification rule (a specific partition of the sample space \(\mathbb{R}^5\)) to classify the given observations (cars) by the number of cylinders (thus, we have three groups – cars with 4 cylinders, 6 cylinders, and 8 cylinders).

Considering a probabilistic model, there is a random vector \(\boldsymbol{X} = (mpg, hp, qsec, drat, disp)^\top \in \mathbb{R}^5\) and the random sample \(\boldsymbol{X}_1, \dots, \boldsymbol{X}_{32}\) drawn from the distribution of the generic random vector \(\boldsymbol{X}\).

library(MASS)
lda1 <- lda(cyl ~ mpg + hp + qsec + drat + disp, data = mtcars)

pred <- predict(lda1)
ldahist(data = pred$x[,1], g=mtcars$cyl)

From the geometrical point of view the linear discriminant analysis is a linear partition of the \(5\) dimensional sample space. A graphical illustration of the particion can be obtained by using the R function partimat() from the library klaR.

library("klaR")
partimat(as.factor(cyl) ~ mpg + hp + qsec + drat + disp, data = mtcars, method = "lda")

In the figure above, the partition of the sample space is provided with respect to all possible pairs of two-dimensional (xy) scatterplots. Alternatively, we can visualize the same data (without the sample space partition) by using the standard command pairs().

pairs(mtcars[c("mpg", "hp", "qsec", "drat", "disp")], pch = 22, bg = c("red", "yellow", "green")[unclass(as.factor(mtcars$cyl))])

There are three groups, thus, we have two directions maximazing the between and withing variablity. It can be all vizualised as follows:

tData <- cbind(as.matrix(mtcars[, c("mpg", "hp", "qsec", "drat", "disp")]) %*% lda1$scaling, mtcars$cyl)
m4 <- apply(tData[tData[,3] == 4, 1:2], 2, mean)
m6 <- apply(tData[tData[,3] == 6, 1:2], 2, mean)
m8 <- apply(tData[tData[,3] == 8, 1:2], 2, mean)
plot(tData[,2] ~ tData[,1], pch = 22, bg = c("red", "yellow", "green")[unclass(as.factor(mtcars$cyl))], xlab = "First LDA direction", ylab = "Second LDA direction")
points(m4[1], m4[2], cex = 2, pch = 21, bg = "red")
points(m6[1], m6[2], cex = 2, pch = 21, bg = "yellow")
points(m8[1], m8[2], cex = 2, pch = 21, bg = "green")

Note

For a better graphical insight into the LDA results we can consider a simplification where we only use one continuous covariate as a predictor.

lda2 <- lda(cyl ~ mpg, data = mtcars)
lda2cv <- lda(cyl ~ mpg, data = mtcars, CV = T)

Thus, the linear discriminant rule is given in terms of one scaling factor only. The corresponding transformations can be obtained as follows:

lda_scores <- as.numeric(lda2$scaling) * mtcars$mpg
par(mfrow = c(1,2))
plot(lda_scores ~ mtcars$mpg, pch = 21, bg = mtcars$cyl, ylab ="LDA Scores", xlab = "Miles per Gallon")

lda2cv <- cbind(lda2cv$posterior, mtcars$mpg, mtcars$cyl)
lda2cv <- lda2cv[order(mtcars$mpg), ]

plot(lda2cv[,1] ~ lda2cv[,4], pch = 21, bg = lda2cv[,5], ylab = "Posterior Probability", xlab = "Miles Per Gallon")
lines(lda2cv[,1] ~ lda2cv[,4], col = "blue")
points(lda2cv[,2] ~ lda2cv[,4], pch = 22, bg = lda2cv[,5])
lines(lda2cv[,2] ~ lda2cv[,4], col = "violet")
points(lda2cv[,3] ~ lda2cv[,4], pch = 23, bg = lda2cv[,5])
lines(lda2cv[,3] ~ lda2cv[,4], col = "gray")

for (i in 1:nrow(lda2cv)){
  lines(rep(lda2cv[i,4], 2), c(0,1), col = "gray", lty = 3)
}

Consider an analogous situation for more than just one continuous predictor. Use some graphical tools to visualize the results.
Use some additional datapoints (feel free to make up some) and try to use the discrimination rules to assign the new points to the existing groups.
Consider the heteroscedastic version of the model and try to apply the quadratic discriminant function (command ‘qda()’ in package ‘MASS’).
Consider, for instance, an alternative classification perfomed within the framework of a logistic model.

Mowers data: Comparison of various methods

Let us consider an artificial and balanced dataset where only three covariates are given for each subject (house holder): the income of the house holder; the lot landscape; the type of the grass mower. Using the information about the income and the landscape we want to find a classification rule for the mower type. Four different methods are used: generalized linear model; linear discriminant analysis; quadratic discriminant analysis; linear regression model;

The data can be loaded in by the command

Mowers <- read.table("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/mowers.txt", sep = "", header = T)

### 2D grid expansion
mowers.gr = expand.grid(seq(20,120,len=1000),seq(14,24,len=1000))
names(mowers.gr) = c("income","lot")

And the results of the four methods are the following:

par(mfrow=c(2,2))
cols <- c("red", "blue")
attach(Mowers)
mowe.ii = unique(mowers.gr$income)
mowe.ll = unique(mowers.gr$lot)

plot(income,lot,pch=21 + riding, bg = cols[riding], main = "Generalized Linear Model")
mow.nn = glm(factor(riding)~income+lot,family=binomial,data=Mowers)

contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="response")>0.5),1000,1000),
        nlev=1,drawlabels=FALSE,add=T)

plot(income,lot,pch=21 + riding, bg = cols[riding], main = "Linear Discriminant Analysis")
mow.nn = lda(factor(riding)~income+lot,data=Mowers)
contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr)$class),1000,1000),
        nlev=1,drawlabels=FALSE,add=T)

plot(income,lot,pch=21 + riding, bg = cols[riding], main = "Quadratic Discriminant Analysis")
mow.nn = qda(factor(riding)~income+lot,data=Mowers)
contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr)$class),1000,1000),
        nlev=1,drawlabels=FALSE,add=T)

plot(income,lot,pch=21 + riding, bg = cols[riding], main = "Linear Regression Model")
mow.nn = lm(I(2*(riding-1)-1)~income+lot,data=Mowers)
contour(mowe.ii,mowe.ll,
        matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="response")>0),1000,1000),
        nlev=1,drawlabels=FALSE,add=T)

Note, that the LDA approach and the standard linear regression performs equivalently (under the assumption that there are two groups only, the groups are labeled with plus/minus one and, finaly, the data are balanced – the same number of observations in each group – 12 in this particular example).

Another discrimination approach gaining more and more popularirty is based on deep neural networks which were originally designed to perform such and similar tasks efficiently. The deep neural network is a highly non-linear non-convex optimization method. Unfortunately, the results heavily depends on an initial configuration: compare, for instance, the following four outputs (the same neural network is applied for the same dataset for all four cases):

To run the command, one needs to install the corresponding package install.library(nnet).

require(nnet)

## Loading required package: nnet

par(mfrow=c(2,2))

for (k in 1:4) {
  mow.nn = nnet(factor(riding)~income+lot,size=50,data=Mowers)
  plot(income,lot,pch=21 + riding, bg = cols[riding], main = "Generalized Linear Model")
  contour(mowe.ii,mowe.ll,
          matrix(as.numeric(predict(mow.nn,new=mowers.gr,type="class")),1000,1000),
          nlev=1,drawlabels=FALSE,add=T)
}

## # weights:  201
## initial  value 16.219449 
## iter  10 value 14.895377
## iter  20 value 12.209095
## iter  30 value 5.903848
## iter  40 value 3.668724
## iter  50 value 1.990298
## iter  60 value 0.436978
## iter  70 value 0.041113
## iter  80 value 0.006322
## iter  90 value 0.000724
## iter 100 value 0.000230
## final  value 0.000230 
## stopped after 100 iterations

## # weights:  201
## initial  value 17.509641 
## iter  10 value 14.928385
## iter  20 value 6.226953
## iter  30 value 4.109904
## iter  40 value 3.936640
## iter  50 value 3.184072
## iter  60 value 2.999208
## iter  70 value 2.863769
## iter  80 value 2.832370
## iter  90 value 2.829609
## iter 100 value 2.816815
## final  value 2.816815 
## stopped after 100 iterations

## # weights:  201
## initial  value 17.474691 
## iter  10 value 15.661115
## iter  20 value 13.518344
## iter  30 value 11.516171
## iter  40 value 6.163555
## iter  50 value 0.393009
## iter  60 value 0.001044
## final  value 0.000036 
## converged

## # weights:  201
## initial  value 65.970186 
## iter  10 value 16.047940
## iter  20 value 15.334066
## iter  30 value 11.388040
## iter  40 value 7.432112
## iter  50 value 4.868239
## iter  60 value 2.988978
## iter  70 value 2.481623
## iter  80 value 2.280658
## iter  90 value 2.243860
## iter 100 value 2.162267
## final  value 2.162267 
## stopped after 100 iterations

In order to get some meaningful results one usually runs the deep neural network for a few times and the final result is averaged over all runs.

Other very popularclassification methods (however, not that must transparent from the mathematical point of view) are, for instance, decision trees (command ctree() in the R library ‘install.packages(“party”)’), support vector machines (command ksvm() in the R library ‘install.packages(“kernlab”)’), or the so called K – nearest neighbor approach (command knn() in the R library ‘install.packages(“class”)’).

Decision trees

library("party")

plot(ctree(factor(riding)~income+lot,data=Mowers))

Decision trees

library("class")

train <- mtcars[1:25,c("mpg", "hp", "qsec", "drat", "disp")]
test <- mtcars[26:32,c("mpg", "hp", "qsec", "drat", "disp")]

c1 <- as.factor(mtcars$cyl[1:25])

knn(train, test, c1, k = 3, prob=TRUE)

## [1] 4 4 4 8 6 8 4
## attr(,"prob")
## [1] 1 1 1 1 1 1 1
## Levels: 4 6 8

Homework Assignment

(Deadline: Lab Session 13 / 10.05.2022)

Use some appropriate data set (ideally the same you used for your previous homework assignments) and apply the linear discriminat analysis classification. Consider two groups – either given within some covariate which is already given in the data or define some arbitrary binary covariate based on some preliminary exploratory analysis (performed in the previous homework assignments).
Use at least one additional alternative classification approach (e.g., linear regression, glm, neural networks, or qda) and compare the results. Provide some graphical outputs and some brief interpretation/explanation for both results.
Add your solution to your personal web page, on Monday, 08.05.2022, 23:59 (CET) at latest.