NMST539 | Lab (lecture) Session 11Mixture models & model based clusteringSummer Term 2021/2022 | 02/05/22source file (UTF8 encoding)The R software is available for download (free of charge) from the website: https://www.r-project.org A user-friendly interface (one of many): RStudio. Manuals and introduction into R (in Czech or English):
1. Mixture modelsMixture models are probabilistic models for representing different sub-populations within one overall (mixture) population however, without requiring that observed data should provide any information regarding to whichsub-populationan individual observations belong to. THe mixture models are, therefore, also known as model based clustering methods (unlike so called distribution-free clustering approaches). In general, in cluster analysis there is no explicit information about the underlying clusters but, instead, the information provided in the data – i.e., different covariates – is utilized to assign the set of the available observations into a set of disjoint clusters. This is all performed by using some similarity/dissimilarity measure – a quantitative measure of a difference between a pair of observations. Recall, that unlike the discriminant analysis, where the true assignment into clusters is known at least for training data, in the model-based clustering (and clustering in general) the true assignment into clusters is unknown. Cluster algorithms discussed so far (last lab session) were all based on a specific distance/proximity measure (e.g., a matrix of distances, or some dissimilarity matrix). Such measure, recorded for all available pairs of the given observations, was actually used as the only input needed for the clustering algorithms to work. In particular, there was not stochastic character of the underlying data-generating process assumed, neither no probabilistic model postulated for the theoretical background. Clustering approaches which take into accound the stochastic nature of the data (the probabilistic distribution of the data) are based on the mixture models (which should not be confused with mixed effects models). Considering the sample view approach, the mixture model assumes, in general, that each observation from the given sample belongs to one of \(K \in \mathbb{N}\) different sub-populations (distributions) that the overal population (mixture) is composed of. The number of populations \(K \in \mathbb{N}\) is again/usually an a-priori assumption (given number). The standard taks is to assign the given observations into \(K\) unknown clusters (sub-populations). However, in addition to the assignment itself (what is actually also done in the distribution-free clustering algorithms, such as the K-means algorithm, or different hierarchical methods the model based clustering also provides the overall estimate of the underlying theoretical distribution of the mixture population. Thus, the output of a typical model-based clustering algorithm is usually very complex and it involves, for instance, the estimated density of the underlying mixture distributions (for instance in terms of some estimated parameters) and also the estimated probabilities (weights) of the sub-populations within the overall population. The given observations are later assigned into the clusters by calculating subject-specific posterior probabilities. The assignment into clusters is, therefore, not that much unique as for the distribution-free algorithms. On the other hand, one usually needs to impose more strict assumptions for the mixture models to work propertly. In statistics, we usually consider some specific family of parametric distributions, for instance \[ \mathcal{F} = \{f_{\boldsymbol{\theta}};~,\boldsymbol{\theta} \in \Theta \subseteq \mathbb{R}^p\}. \] where \(f_{\boldsymbol{\theta}}(\boldsymbol{x}) \equiv f(\boldsymbol{x}, \boldsymbol{\theta})\) for some \(\boldsymbol{x} \in \mathbb{R}^k\). Usually a random sample is assumed to be drawn from some unknown distribution (density function) \(f_\boldsymbol{\theta} \in \mathcal{F}\) and the unknown parameter is estimated from the given sample. A specific parametric form of the underlying probabilistic model is assumed within the given family \(\mathcal{F}\) but the underlying distribution is left unknown due to the unknown value of the (vector) parameter \(\boldsymbol{\theta} \in \Theta\). For mixture models, somehow more complex families of distributions are defined. For instance, one can define a set of parametric mixtures consisting of the same type sub-populations distributions \[ \mathcal{M} = \Big\{p;~p(\boldsymbol{x}) = \sum_{j = 1}^J \pi_j f(\boldsymbol{x}, \boldsymbol{\theta}_j), \pi_j > 0, \sum_{j = 1}^J \pi_j = 1, J \in \mathbb{N}, f_{\boldsymbol{\theta}_j} \in \mathcal{F}, \forall j = 1, \dots, J\Big\}. \] The model based clustering uses a dataset drawn from some (univariate/multivariate) distribution \(p \in \mathcal{M}\) and the output of such clustering is usually provided in terms of the estimated weights – probabilites \(\widehat{\pi}_j\), for \(j = 1, \dots, J\) and, also, the set of the estimated parameters \(\widehat{\boldsymbol{\theta}}_1, \dots, \widehat{\boldsymbol{\theta}}_J\). The number of clusters is usually given apriori nevertheless it can be also assumed by the model-based clustering algorithm but usually additional regularization constraints are needed to guarantee a proper solution. For illustration purposes we use the dataset
Using the known nature of the data (and the fact that there are three different types of automobiles when referring to the number of cylinders) we will assume that the observations are randomply drawn from a mixture model consiting of three different sub-populations (thus, \(K = 3\)) being identified by the number of cylinders in the given car (in the given dataset there are only cars with either four, six, or eight cylinders – the information is provided within the covariate denoted as For simplicity, we will be further only interested in a univariate sample respresenting the car consumption (miles per gallon) in terms of the random sample \(X_1, \dots, X_{32} \sim p(x)\), where \(p \in \mathcal{M}\) for \(\mathcal{M}\) being a set of gaussian mixtures with three sub-populations only.
From the theoretical point of view, for each sub-population we can define specific qualitative and quantitative characteristics – such as, for instance, the sub-population (conditional) mean, variance, median, or complex and complete characteristics such as the density, distribution fucntion, or the characteristic function (and many others of course). Similarly, analogous characteristics can be also defined jointly – for the whole mixture population. From the empirical point of view, the available data can be used calculate the corresponding empirical surrogate for each theoretical characteristic – sample mean, sample variance, etc. (jointly or conditionaly). Analogously, one can also estimate proportions of the cars with four, six, and eight cylinders. Once such information is obtained one can easilty estimated the underlying model using the restrictions regarding a specific parametric mixture model (in terms of the set \(\mathcal{M}\)). Various estimation techniques can be used to estimate the overall mixture model and the corresponding distributions of the mixture sub-populations (e.g., the method of moments, the generalized method of moments, or the maximum likelihood theory). Iterative optimization algorithms can be used as well. Let us assume that the car consumption – variable
Very restricted model is used in the picture above (Gaussian mixture, homoscedasticity, sample proportions used for the weights), nevertheless, the estimated mixture model can be still compared with the underlying data:
This does not look that well (theoretical model does not seem to correspond with the empirical data) as the model in th figure is very much restricted by the underlying assumptions. However, assuming now that the true model holds and simulating the data from the underlying mixture, much better results are obtained:
For simplicity, we a assumed unit variance for all three mixtures – three underlying clusters. However, for some observations it is not clear to which cluster are they supposed to belong to. Note, that three clusters defined by three underlying normal distributions are not disjoint – as it is the case for distribution-free clustering algorithms.
The estimated mixture model – the estimated overall density – has a few different maxima – one for each mixture distribution (sub-population). Local maxima are located at the estimated conditional means \(\widehat{\mu}_j \in \mathbb{R}\), for \(j = 1, 2, 3\). In other word, the estimated means are the empirical surogates for \[ \mu_j = E[X | j], \quad j = 1, 2, 3, \] which stands for the expected consumption (miles per gallon) of a car from cluster \(j \in \{1, 2, 3\}\). On the other hand, for some sub-populations which are “close to each other”, it is likely that the final mixture model will be unimodal (just one maxima). A non-unique assignment of some observations is also clear from the figure above: For instance, an automobile with the milage (consumption) of 15 miles per one gallon would be intuitively assigned into a subpopulation of cars with 8 cylinders, however, for another automobile with the milage of 17.5 miles there would be probably some doubts whether the car should be assigned among eight or six cylinder cars (the assignment “probability” given by the mixture may seem the same, however, the posterior probability for an inclussion to the first or the second sub-populations is not the same).
In practical applications the overall complexity of the mixture models that are used is further enlarged. For instance, the number of subpopulations can be also treated as an unknown parameter which must be estimated. Different parametric distributions (families) can be used to form mixture subpopulations or even non-parametric approaches can be applied. For parametric mixtures the maximum likelihood theory is usually used to estimate the mixture (thus, the underlying family of distributions must be imposed). Expectation/Maximization (EM) algorithm is usually applied for more complex mixture models. Additional regularity assumptions are however, always needed to obtain reasonable estimates. Individual task
Alternative mixture model taking into account possible heteroscedasticity in sub-populations (however, assuming three underlying mixture distributions).
How would you assign the car with the consumption of 17.5 miles per gallon now? Is it possible to take into account the mutual heteroscedasticity and different variability of the cars with eight and six cilinders? Can you use the estimated mixture to estimate posterior probabilities?
Recall, that the proportion of the cars (four, six, and eight cylinders) are
2. General mixture modelsA mixture model can actually contain sub-populations from different parametric families (which is also more common in practical applications). It is also possible to define a mixture model on a semi-parameetric or even nonparametric basis. For the parametric mixtures it is crutial to properly define the overall likelihood which is later used to estimate the mixture model itself and the corresponding subpopulations. Semi-parametric and nonparametric mixtures are alternatively estimated by adopting the EM algorithm for instance. Very common (regularization) assumptions require the given number of clusters (sub-populations) or, instead, the number of clustersmay be unknown but some overall shape restrictions are formulated instead (e.g., unimodality, log-concavity, etc.). A relatively very general class of mixture models to be fitted in R can be estimated by using the library mixtools. The most common function which fits normal mixture models (with the predefined number of subpopulations) by adopting the EM algorithm is
Can you explain the differences that you observe between figure above (EM) algorithm and one of the plots at the begining (normal mixture with unit variances)? Compare the following outputs:
And also with the following:
Nonparametric mixture models can be obtained by using the function
Some other R libraries which allow to fit mixture models are, for instance, mixR, mixture, rebmix, or FlexMix. 3. Mixture models in regressionMixture models are also commonly applied in various regression models. For a brief illustration we can particularly refer to a standard GLM model (for Poisson counts for instance and the logarithmic link function). It usually happens in real life situations that in the collected random sample contains some obsarvations which do not correspond well enough with the underlying theoretical model (e.g., many zeros (no cases) occurring for the Poisson counts). For instance, we can be interested in modelling the number of positive Covid-19 cases (during a day for instance, or within a specific geographical location). As far as the prevalence is relatively low, it is very likely that a random subject that we will interviewed on a street is not infected (thus, no case). This will result in a data sample which does not correspond with the theoretical Poisson model (too many zeros). This can be, however, improved by considering a relatively simple mixture model known as a “zero-inflated” Poisson model, which is actually a mixture model of two different mixtures: \[ Y_i | \boldsymbol{x}_i \sim \left\{ \begin{array} 00 & \textrm{with probability } p \in (0,1);\\ Poiss(\lambda_i) & \textrm{with probability } 1 - p_i. \end{array} \right. \] The first mixture – the Dirac measure – is used to model negative results (no cases) and it will take care of inappropriately many zeros in the data. The second component of the mixture – a standard Poisson model is used to model positive cases and their overall number in some given period of time or within some geographical location. Additional information may be recorded for each observation (so called “subject-specific” covariates, however, note that in this particular situation “subject-specific” information does not refer to subjects themselves). Remember, that each observation of the sample provides an information summarized over some period of time or some geographical location (thus, many subjects/patients/individuals). Thus, the subject-specific covariates may reflect some important information about the follow-up window, location parameteres, etc. Considering the zero-inflated Poisson model, the probability of having a “no case” can be expressed as \[P[Y_i = 0 | \boldsymbol{x}_i] = p + (1 - p) \cdot e^{- \boldsymbol{x}_i^\top \boldsymbol{\beta}}\] while the probability of observing \(k > 0\) cases is given (by the theory of the Poisson distribution) as \[P[Y_i = k | \boldsymbol{x}_i] = (1 - p) \cdot \frac{(\boldsymbol{x}_i^\top \boldsymbol{\beta})^k}{k!}e^{- \boldsymbol{x}_i^\top \boldsymbol{\beta}}\] while for \(\lambda_i = \boldsymbol{x}_i^\top \boldsymbol{\beta}\) we have a standard GLM regression model of the form \[\lambda_i = E[Y_i | \boldsymbol{x}_i] = \boldsymbol{x}_i^\top \boldsymbol{\beta},\] with an unknown paraemter vector \(\boldsymbol{\beta} \in \mathbb{R}^p\). THe final mixture is again estimated in terms of the maximum likelihood. In the R software there is a function NoteZero-inflated mixture models are very common in mathematical statistics (see, for instance, the zero-inflated binomial model, the zero-inflated negative binomial model, or zero-inflated GLM models in general). From the overall point of view, it is a very rich and flexible class of regression models. The maximum likelihood theory allows for relatively simple estimation. For more theoretical details on zero-inflated models in statistics we refer to Zuur and Ieno (2016). Beginner’s Guide to Zero-Inflated Models with R. |