NMST539 | Lab (lecture) Session 11

Mixture models & model based clustering

Summer Term 2021/2022 | 02/05/22

source file (UTF8 encoding)

The R software is available for download (free of charge) from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

Theoretical references

Hardle, W. and Simar.L.: Applied Multivariate Statistical Analysis. Springer, 2015
Mardia, K., Kent, J., and Bibby, J.:Multivariate Analysis, Academic Press, 1979.

1. Mixture models

Mixture models are probabilistic models for representing different sub-populations within one overall (mixture) population however, without requiring that observed data should provide any information regarding to whichsub-populationan individual observations belong to. THe mixture models are, therefore, also known as model based clustering methods (unlike so called distribution-free clustering approaches). In general, in cluster analysis there is no explicit information about the underlying clusters but, instead, the information provided in the data – i.e., different covariates – is utilized to assign the set of the available observations into a set of disjoint clusters. This is all performed by using some similarity/dissimilarity measure – a quantitative measure of a difference between a pair of observations. Recall, that unlike the discriminant analysis, where the true assignment into clusters is known at least for training data, in the model-based clustering (and clustering in general) the true assignment into clusters is unknown.

Cluster algorithms discussed so far (last lab session) were all based on a specific distance/proximity measure (e.g., a matrix of distances, or some dissimilarity matrix). Such measure, recorded for all available pairs of the given observations, was actually used as the only input needed for the clustering algorithms to work. In particular, there was not stochastic character of the underlying data-generating process assumed, neither no probabilistic model postulated for the theoretical background.

Clustering approaches which take into accound the stochastic nature of the data (the probabilistic distribution of the data) are based on the mixture models (which should not be confused with mixed effects models).

Considering the sample view approach, the mixture model assumes, in general, that each observation from the given sample belongs to one of \(K \in \mathbb{N}\) different sub-populations (distributions) that the overal population (mixture) is composed of. The number of populations \(K \in \mathbb{N}\) is again/usually an a-priori assumption (given number). The standard taks is to assign the given observations into \(K\) unknown clusters (sub-populations). However, in addition to the assignment itself (what is actually also done in the distribution-free clustering algorithms, such as the K-means algorithm, or different hierarchical methods the model based clustering also provides the overall estimate of the underlying theoretical distribution of the mixture population. Thus, the output of a typical model-based clustering algorithm is usually very complex and it involves, for instance, the estimated density of the underlying mixture distributions (for instance in terms of some estimated parameters) and also the estimated probabilities (weights) of the sub-populations within the overall population.

The given observations are later assigned into the clusters by calculating subject-specific posterior probabilities. The assignment into clusters is, therefore, not that much unique as for the distribution-free algorithms. On the other hand, one usually needs to impose more strict assumptions for the mixture models to work propertly.

In statistics, we usually consider some specific family of parametric distributions, for instance \[ \mathcal{F} = \{f_{\boldsymbol{\theta}};~,\boldsymbol{\theta} \in \Theta \subseteq \mathbb{R}^p\}. \]

where \(f_{\boldsymbol{\theta}}(\boldsymbol{x}) \equiv f(\boldsymbol{x}, \boldsymbol{\theta})\) for some \(\boldsymbol{x} \in \mathbb{R}^k\). Usually a random sample is assumed to be drawn from some unknown distribution (density function) \(f_\boldsymbol{\theta} \in \mathcal{F}\) and the unknown parameter is estimated from the given sample. A specific parametric form of the underlying probabilistic model is assumed within the given family \(\mathcal{F}\) but the underlying distribution is left unknown due to the unknown value of the (vector) parameter \(\boldsymbol{\theta} \in \Theta\).

For mixture models, somehow more complex families of distributions are defined. For instance, one can define a set of parametric mixtures consisting of the same type sub-populations distributions

\[ \mathcal{M} = \Big\{p;~p(\boldsymbol{x}) = \sum_{j = 1}^J \pi_j f(\boldsymbol{x}, \boldsymbol{\theta}_j), \pi_j > 0, \sum_{j = 1}^J \pi_j = 1, J \in \mathbb{N}, f_{\boldsymbol{\theta}_j} \in \mathcal{F}, \forall j = 1, \dots, J\Big\}. \]

The model based clustering uses a dataset drawn from some (univariate/multivariate) distribution \(p \in \mathcal{M}\) and the output of such clustering is usually provided in terms of the estimated weights – probabilites \(\widehat{\pi}_j\), for \(j = 1, \dots, J\) and, also, the set of the estimated parameters \(\widehat{\boldsymbol{\theta}}_1, \dots, \widehat{\boldsymbol{\theta}}_J\).

The number of clusters is usually given apriori nevertheless it can be also assumed by the model-based clustering algorithm but usually additional regularization constraints are needed to guarantee a proper solution.

For illustration purposes we use the dataset mtcars available under the standard R installation. The dataset contains 32 independent observations (32 automobiles) and 11 different covariates (continuous as well as discrete).

rm(list = ls())
attach(mtcars)
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Using the known nature of the data (and the fact that there are three different types of automobiles when referring to the number of cylinders) we will assume that the observations are randomply drawn from a mixture model consiting of three different sub-populations (thus, \(K = 3\)) being identified by the number of cylinders in the given car (in the given dataset there are only cars with either four, six, or eight cylinders – the information is provided within the covariate denoted as cyl).

For simplicity, we will be further only interested in a univariate sample respresenting the car consumption (miles per gallon) in terms of the random sample \(X_1, \dots, X_{32} \sim p(x)\), where \(p \in \mathcal{M}\) for \(\mathcal{M}\) being a set of gaussian mixtures with three sub-populations only.

boxplot(mpg ~ cyl, data= mtcars, col = "gray", xlab = "Sub-population by the number of cylinders", ylab = "Car consumption", main = "")

From the theoretical point of view, for each sub-population we can define specific qualitative and quantitative characteristics – such as, for instance, the sub-population (conditional) mean, variance, median, or complex and complete characteristics such as the density, distribution fucntion, or the characteristic function (and many others of course). Similarly, analogous characteristics can be also defined jointly – for the whole mixture population.

From the empirical point of view, the available data can be used calculate the corresponding empirical surrogate for each theoretical characteristic – sample mean, sample variance, etc. (jointly or conditionaly). Analogously, one can also estimate proportions of the cars with four, six, and eight cylinders. Once such information is obtained one can easilty estimated the underlying model using the restrictions regarding a specific parametric mixture model (in terms of the set \(\mathcal{M}\)).

Various estimation techniques can be used to estimate the overall mixture model and the corresponding distributions of the mixture sub-populations (e.g., the method of moments, the generalized method of moments, or the maximum likelihood theory). Iterative optimization algorithms can be used as well.

Let us assume that the car consumption – variable mpg – is (conditionally) normally distributed and the specific normal distribution depends on the number of cylinders. Thus, the overall distribution is a mixture of three normal distributions where each can possibly have different mean parameter but the variance parameter is assumed to be equal to one. The estimation of the overall mixture is relatively straightforward now – the estimated mixture model with three normal sub-populations may be estimated in steps as follows:

### population mean estimates
mean1 <- mean(mpg[cyl == 4])
mean2 <- mean(mpg[cyl == 6])
mean3 <- mean(mpg[cyl == 8])

### estimated proportions for each population
prop <- table(cyl)/dim(mtcars)[1]

### the overall mixture distribution
mixture <- function(x, prop, means){
  mixValue <- 0
  for (i in 1:length(prop)){
    mixValue <- mixValue + prop[i] * dnorm(x, means[i])
  }
  return(mixValue)
}

xSeq <- seq(10, 34, length = 100)
density <- mixture(xSeq, prop, c(mean1, mean2, mean3))
plot(density ~ xSeq, col = "red", type = "l", xlab = "Miles per Gallon", ylab = "Density", ylim = c(0, 0.2))

Very restricted model is used in the picture above (Gaussian mixture, homoscedasticity, sample proportions used for the weights), nevertheless, the estimated mixture model can be still compared with the underlying data:

hist(mpg, freq = F, xlab = "Miles per Gallon", breaks = 15, ylim = c(0,0.2), co = "lightgray")
lines(density ~ xSeq, col = "red", type = "l", xlab = "Miles per Gallon", ylab = "Density")

This does not look that well (theoretical model does not seem to correspond with the empirical data) as the model in th figure is very much restricted by the underlying assumptions. However, assuming now that the true model holds and simulating the data from the underlying mixture, much better results are obtained:

N <- 1000 
sub <- rmultinom(N, 1, prob = prop)

sample <- as.vector(N)
sample[sub[1,] == 1] <- rnorm(sum(sub[1,]), mean1, 1)
sample[sub[2,] == 1] <- rnorm(sum(sub[2,]), mean2, 1)
sample[sub[3,] == 1] <- rnorm(sum(sub[3,]), mean3, 1)

hist(sample, freq = F, xlab = "Miles per Gallon", breaks = 50, ylim = c(0,0.2), co = "lightgray")
lines(density ~ xSeq, col = "red", type = "l", xlab = "Miles per Gallon", ylab = "Density")

For simplicity, we a assumed unit variance for all three mixtures – three underlying clusters. However, for some observations it is not clear to which cluster are they supposed to belong to. Note, that three clusters defined by three underlying normal distributions are not disjoint – as it is the case for distribution-free clustering algorithms.

plot(density ~ xSeq, col = "red", type = "l", lwd = 3, xlab = "Miles per Gallon", ylab = "Density")
d1 <- density(rnorm(10000, mean1, 1))
d2 <- density(rnorm(10000, mean2, 1))
d3 <- density(rnorm(10000, mean3, 1))
lines(d1$x, prop[1] * d1$y, col = "black", lwd = 2)
lines(d2$x, prop[2] * d2$y, col = "violet", lwd = 2)
lines(d3$x, prop[3] * d3$y, col = "gray", lwd = 2)
lines(c(17.5, 17.5), c(0, 0.15), lwd = 2, col = "black", lty = 2)

The estimated mixture model – the estimated overall density – has a few different maxima – one for each mixture distribution (sub-population). Local maxima are located at the estimated conditional means \(\widehat{\mu}_j \in \mathbb{R}\), for \(j = 1, 2, 3\). In other word, the estimated means are the empirical surogates for \[ \mu_j = E[X | j], \quad j = 1, 2, 3, \] which stands for the expected consumption (miles per gallon) of a car from cluster \(j \in \{1, 2, 3\}\). On the other hand, for some sub-populations which are “close to each other”, it is likely that the final mixture model will be unimodal (just one maxima).

A non-unique assignment of some observations is also clear from the figure above: For instance, an automobile with the milage (consumption) of 15 miles per one gallon would be intuitively assigned into a subpopulation of cars with 8 cylinders, however, for another automobile with the milage of 17.5 miles there would be probably some doubts whether the car should be assigned among eight or six cylinder cars (the assignment “probability” given by the mixture may seem the same, however, the posterior probability for an inclussion to the first or the second sub-populations is not the same).

denom <- prop[1] * dnorm(17.5, mean3, 1) + prop[2] * dnorm(17.5, mean2, 1) + prop[3] * dnorm(17.5, mean1, 1)
(prop[1] * dnorm(17.5, mean3, 1) / denom)

##         4 
## 0.5217835

(prop[2] * dnorm(17.5, mean2, 1) / denom)

##         6 
## 0.4782165

In practical applications the overall complexity of the mixture models that are used is further enlarged. For instance, the number of subpopulations can be also treated as an unknown parameter which must be estimated. Different parametric distributions (families) can be used to form mixture subpopulations or even non-parametric approaches can be applied. For parametric mixtures the maximum likelihood theory is usually used to estimate the mixture (thus, the underlying family of distributions must be imposed). Expectation/Maximization (EM) algorithm is usually applied for more complex mixture models. Additional regularity assumptions are however, always needed to obtain reasonable estimates.

Individual task

Unrestricted maximum likelihood \[ Maximize \prod_{k = 1}^{K} \prod_{i = 1}^{n_k} f_k(X_{k i}) \equiv Maximize \prod_{k = 1}^{K} \prod_{i = 1}^{n_k} \frac{1}{\sqrt{2 \pi}} \cdot exp \left\{\frac{-1}{2} (X_{k i} - \mu_k)^2\right\} \] where the maximization takes place with respect to the number of sub-populations/cluters \(K \in \{1, \dots, n\}\), and the cluster specific mean parameters \(\mu_1, \dots, \mu_K \in \mathbb{R}\) (for \(n \in \mathbb{N}\) being the overall number of observations, and \(n_k \in \mathbb{N}\) being the number of subjects in the \(k\)-th sub-population) lead to a useless estimates (dirac explosion). The resulting mixture model would look something like this:
```
plot(density ~ xSeq, col = "gray", type = "l", lwd = 2, xlab = "Miles per Gallon", ylab = "Density", lty = 2, ylim = c(0,0.5))
for (i in 1:length(mpg)){
  lines(rep(mpg[i], 2), c(dnorm(0), 0), col = "red", lwd = 1, lty = 3)
  points(mpg[i], dnorm(0), pch = 21, cex = 0.5, bg = "red")
}
```
Do you know where is the problem in the maximization formulation above?
How would you interpret the “mixture” estimated in the figure above?
Can you think of some regularization conditions which would ensure some more reasonable estimate of the underlying mixture?

Alternative mixture model taking into account possible heteroscedasticity in sub-populations (however, assuming three underlying mixture distributions).

### population standard error estimates
sd1 <- sd(mpg[cyl == 4])
sd2 <- sd(mpg[cyl == 6])
sd3 <- sd(mpg[cyl == 8])

### the overall mixture distribution
mixtureHetero <- function(x, prop, means, sd){
  mixValue <- 0
  for (i in 1:length(prop)){
    mixValue <- mixValue + prop[i] * dnorm(x, means[i], sd[i])
  }
  return(mixValue)
}

xSeq <- seq(10, 34, length = 100)
density <- mixtureHetero(xSeq, prop, c(mean1, mean2, mean3), c(sd1, sd2, sd3))
hist(mpg, freq = F, xlab = "Miles per Gallon", breaks = 15, ylim = c(0,0.2), co = "lightgray")
lines(density ~ xSeq, col = "red", type = "l", xlab = "Miles per Gallon", ylab = "Density", lwd = 2)
lines(c(17.5, 17.5), c(0, 0.20), col = "black", lty = 2)

How would you assign the car with the consumption of 17.5 miles per gallon now? Is it possible to take into account the mutual heteroscedasticity and different variability of the cars with eight and six cilinders? Can you use the estimated mixture to estimate posterior probabilities?

denom <- prop[1] * dnorm(17.5, mean3, sd3) + prop[2] * dnorm(17.5, mean2, sd2) + prop[3] * dnorm(17.5, mean1, sd1)
(prop[1] * dnorm(17.5, mean3, sd3) / denom)

##         4 
## 0.5983844

(prop[2] * dnorm(17.5, mean2, sd2) / denom)

##         6 
## 0.3164807

Recall, that the proportion of the cars (four, six, and eight cylinders) are

table(mtcars$cyl)/nrow(mtcars)

## 
##       4       6       8 
## 0.34375 0.21875 0.43750

2. General mixture models

A mixture model can actually contain sub-populations from different parametric families (which is also more common in practical applications). It is also possible to define a mixture model on a semi-parameetric or even nonparametric basis.

For the parametric mixtures it is crutial to properly define the overall likelihood which is later used to estimate the mixture model itself and the corresponding subpopulations. Semi-parametric and nonparametric mixtures are alternatively estimated by adopting the EM algorithm for instance.

Very common (regularization) assumptions require the given number of clusters (sub-populations) or, instead, the number of clustersmay be unknown but some overall shape restrictions are formulated instead (e.g., unimodality, log-concavity, etc.).

A relatively very general class of mixture models to be fitted in R can be estimated by using the library mixtools. The most common function which fits normal mixture models (with the predefined number of subpopulations) by adopting the EM algorithm is normalmixEM().

library("mixtools")

## mixtools package, version 1.2.0, Released 2020-02-05
## This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.

mixture1 <- normalmixEM(mtcars$mpg, lambda = .5, mu = c(mean1, mean2, mean3), sigma = 1)

## number of iterations= 60

plot(mixture1, density=TRUE, main2="Consumption", xlab2="Miles per Gallon")

Can you explain the differences that you observe between figure above (EM) algorithm and one of the plots at the begining (normal mixture with unit variances)? Compare the following outputs:

mixture1$lambda

## [1] 0.1617218 0.4220795 0.4161986

prop

## cyl
##       4       6       8 
## 0.34375 0.21875 0.43750

And also with the following:

mixture2 <- normalmixEM(mtcars$mpg, lambda = .5, mu = c(mean1, mean2, mean3), sigma = c(1,1,1))

## number of iterations= 707

plot(mixture2, density=TRUE, main2="Consumption", xlab2="Miles per Gallon")

Nonparametric mixture models can be obtained by using the function spEMsymloc() (also from the R package `mixtools’).

mixture3 <- spEMsymloc(mtcars$mpg, mu0 = c(mean1, mean2, mean3), bw=1)
plot(mixture3, xlab = "Miles per Gallon")

Some other R libraries which allow to fit mixture models are, for instance, mixR, mixture, rebmix, or FlexMix.

3. Mixture models in regression

Mixture models are also commonly applied in various regression models. For a brief illustration we can particularly refer to a standard GLM model (for Poisson counts for instance and the logarithmic link function). It usually happens in real life situations that in the collected random sample contains some obsarvations which do not correspond well enough with the underlying theoretical model (e.g., many zeros (no cases) occurring for the Poisson counts). For instance, we can be interested in modelling the number of positive Covid-19 cases (during a day for instance, or within a specific geographical location). As far as the prevalence is relatively low, it is very likely that a random subject that we will interviewed on a street is not infected (thus, no case). This will result in a data sample which does not correspond with the theoretical Poisson model (too many zeros).

This can be, however, improved by considering a relatively simple mixture model known as a “zero-inflated” Poisson model, which is actually a mixture model of two different mixtures:

\[ Y_i | \boldsymbol{x}_i \sim \left\{ \begin{array} 00 & \textrm{with probability } p \in (0,1);\\ Poiss(\lambda_i) & \textrm{with probability } 1 - p_i. \end{array} \right. \]

The first mixture – the Dirac measure – is used to model negative results (no cases) and it will take care of inappropriately many zeros in the data. The second component of the mixture – a standard Poisson model is used to model positive cases and their overall number in some given period of time or within some geographical location. Additional information may be recorded for each observation (so called “subject-specific” covariates, however, note that in this particular situation “subject-specific” information does not refer to subjects themselves).

Remember, that each observation of the sample provides an information summarized over some period of time or some geographical location (thus, many subjects/patients/individuals). Thus, the subject-specific covariates may reflect some important information about the follow-up window, location parameteres, etc.

Considering the zero-inflated Poisson model, the probability of having a “no case” can be expressed as

\[P[Y_i = 0 | \boldsymbol{x}_i] = p + (1 - p) \cdot e^{- \boldsymbol{x}_i^\top \boldsymbol{\beta}}\]

while the probability of observing \(k > 0\) cases is given (by the theory of the Poisson distribution) as

\[P[Y_i = k | \boldsymbol{x}_i] = (1 - p) \cdot \frac{(\boldsymbol{x}_i^\top \boldsymbol{\beta})^k}{k!}e^{- \boldsymbol{x}_i^\top \boldsymbol{\beta}}\]

while for \(\lambda_i = \boldsymbol{x}_i^\top \boldsymbol{\beta}\) we have a standard GLM regression model of the form

\[\lambda_i = E[Y_i | \boldsymbol{x}_i] = \boldsymbol{x}_i^\top \boldsymbol{\beta},\]

with an unknown paraemter vector \(\boldsymbol{\beta} \in \mathbb{R}^p\). THe final mixture is again estimated in terms of the maximum likelihood. In the R software there is a function zeroinfl() in the library ‘pscl’.

Note

Zero-inflated mixture models are very common in mathematical statistics (see, for instance, the zero-inflated binomial model, the zero-inflated negative binomial model, or zero-inflated GLM models in general). From the overall point of view, it is a very rich and flexible class of regression models. The maximum likelihood theory allows for relatively simple estimation. For more theoretical details on zero-inflated models in statistics we refer to

Zuur and Ieno (2016). Beginner’s Guide to Zero-Inflated Models with R.