NMST539 | Lab Session 4

Maximum Likelihood Estimation

(theoretical examples with applications in statistical tests)

Summer Term 2021/2022 | 22/03/22

source file (UTF8 encoding)

Outline:

theoretical background for the maximum likelihood theory;
applications for multivariate parameter estimation;
matrix algebra for infinitesimal calculus;
theoretical examples – statistical tests;

Literature:

Hardle, W. and Simar.L.: Applied Multivariate Statistical Analysis. Springer, 2015
Mardia, K., Kent, J., and Bibby, J.:Multivariate Analysis, Academic Press, 1979.

The maximum likelihood theory is a powerfull tool for estimating some unknown population parameter \(\theta \in \Omega\) (or some vector of unknown parameters \(\boldsymbol{\theta} \in \Omega\) respectively). The main idea is to define a so-called likelihood function: for some random sample (in general, of some dimension \(p \in \mathbb{N}\)), \(\boldsymbol{X}_1, \dots, \boldsymbol{X}_{n}\), drawn from some distribution family given by the density function \(f(\boldsymbol{x}, \boldsymbol{\theta})\), we define the likelihood function as

\[ L(\boldsymbol{\theta}, \mathcal{X}) = \prod_{i = 1}^n f(\boldsymbol{X}_i, \boldsymbol{\theta}), \] which is the function of the unknown vector of parameters \(\boldsymbol{\theta} \in \Omega\) given the data \(\mathcal{X} = (\boldsymbol{X}_{1}, \dots, \boldsymbol{X}_{n})^\top\). The corresponding estimate \(\widehat{\boldsymbol{\theta}}\) of \(\boldsymbol{\theta}\) is defined such, that the likelihood function \(L(\boldsymbol{\theta}, \mathcal{X})\) is maximized, e.i.,

\[ \widehat{\boldsymbol{\theta}} = \mathop{Argmax}_{\boldsymbol{\theta} \in \Omega} L(\boldsymbol{\theta}, \mathcal{X}). \]

The maximum likelihood estimate \(\widehat{\boldsymbol{\theta}}\) of the unknown vector of parameters \(\boldsymbol{\theta} \in \Omega\) is, under some regularity assumptions, well known for having some nice statistical properties, in particular, it holds that

\[ \sqrt{n} (\widehat{\boldsymbol{\theta}} - \boldsymbol{\theta}) \sim N_{p}(\boldsymbol{0}, \mathbb{I}_{p}^{-1}(\boldsymbol{\theta})), \quad \textrm{for} \quad n \to \infty \] where \(\mathbb{I}_{p}(\boldsymbol{\theta})\) stands for the Fisher information matrix about the parameter vector \(\boldsymbol{\theta} \in \Omega\). Thus, the maximum likelihood estimate \(\widehat{\boldsymbol{\theta}}\) is asymptotically unbiasied with the asymptotic normal distribution (\(p\)-dimensional in general) and, moreover, it is efficient (in a sense that the asymptotic variance-covariance matrix \(\mathbb{I}_{p}^{-1}(\boldsymbol{\theta})\) is as good as the Rao-Cramer lower bound).

1. Matrix algebra for multivariate infinitesimal calculus

For a multidimensional parameter \(\boldsymbol{\theta} = (\theta_1, \dots, \theta_p)^\top \in \Omega \subseteq \mathbb{R}^p\) it is quite convenient to use matrix notation for calculations and, consequently, to use the matrix derivatives correspondingly. Let us remind some standard formulae for the matrix derivatives.

Verify by yourself (by using \(2 \times 2\) and \(3 \times 3\) symmetric and asymmetric matrices), that the provided formulae for the matrix derivatives are correct.

Task 1: For some general square matrix \(\mathcal{X}\) of the type \(p \times p\) we have: \[\frac{\partial |\mathcal{X}|}{\partial \mathcal{X}} = \big(Adj(\mathcal{X})\big)^\top;\]

Task 2: For some symmetric square matrix \(\mathcal{X}\) of the type \(p \times p\) we have: \[\frac{\partial |\mathcal{X}|}{\partial \mathcal{X}} = 2 (Adj(\mathcal{X})\big) - Diag(Adj(\mathcal{X}));\]

Task 3: For some general square matrices \(\mathcal{X}\) and \(\mathbb{A}\) of the type \(p \times p\), we have: \[\frac{\partial tr(\mathcal{X} \mathbb{A})}{\partial \mathcal{X}} = \frac{\partial tr(\mathbb{A} \mathcal{X})}{\partial \mathcal{X}} = \mathbb{A}^\top;\]

Task 4: For some square matrices \(\mathcal{X}\) and \(\mathbb{A}\) of the type \(p \times p\), where \(\mathbb{X}\) and \(\mathbb{A}\) are symmetric, we have: \[\frac{\partial tr(\mathcal{X} \mathbb{A})}{\partial \mathcal{X}} = \frac{\partial tr(\mathbb{A} \mathcal{X})}{\partial \mathcal{X}} = 2 \mathbb{A} - Diag(\mathbb{A});\]

Task 5: For some general square matrices \(\mathcal{X}\) and \(\mathbb{A}\) of the type \(p \times p\), we have: \[\frac{\partial tr(\mathcal{X}^{-1} \mathbb{A})}{\partial \mathcal{X}} = \frac{\partial tr(\mathbb{A} \mathcal{X}^{-1})}{\partial \mathcal{X}} = - \mathcal{X}^{-1}\mathbb{A}^\top \mathcal{X}^{-1};\]

Task 6: For some matrices \(\mathbb{A}\) and \(\mathbb{B}\) of the the types \(q \times p\) and \(p \times q\) and the square matrix \(\mathcal{X}\) of the type \(p \times p\), we have: \[\frac{\partial tr(\mathbb{A}\mathcal{X} \mathbb{B})}{\partial \mathcal{X}} = \frac{\partial tr(\mathbb{B} \mathbb{A} \mathcal{X})}{\partial \mathcal{X}} = \mathbb{B}\mathbb{A};\]

Task 7: For some general and square matrix \(\mathcal{X}\) of the type \(p \times p\) we have: \[\frac{\partial \ln |\mathcal{X}|}{\partial \mathcal{X}} = \frac{1}{det(\mathcal{X})} Adj(\mathcal{X}) = (\mathcal{X}^{-1})^\top \equiv \mathcal{X}^{{-\top}};\]

Task 8: For some symetric and square matrix \(\mathcal{X}\) of the type \(p \times p\) we have: \[\frac{\partial \ln |\mathcal{X}|}{\partial \mathcal{X}} = 2 \mathcal{X}^{-1} - Diag(\mathcal{X}^{-1});\]

The provided formulae are usuful for deriving the maximum likelihood estimate for various random vector distributions. In particular, we focus on estimating the mean vector \(\boldsymbol{\mu} \in \mathbb{R}^p\) and the variance-covariance matrix \(\Sigma \in \mathbb{p \times p}\) (which is symetric and positive definite) in a general multivariate normal distribution \(N_{p}(\boldsymbol{\mu}, \Sigma)\).

2. Maximum likelihood estimation for a multivariate parameter

Example 1

Consider a two dimensional sample \(\boldsymbol{X}_1, \dots, \boldsymbol{X}_{n}\), for \(\boldsymbol{X}_{i} = (X_{i 1}, X_{i, 2})^\top\), drawn from the bivariate distribution, given by the density function \[ f_{\boldsymbol{\theta}}(x_{1}, x_{2}) = \frac{1}{\theta_1\theta_2} \cdot exp\Big\{ -\big( \frac{x_1}{\theta_1} + \frac{x_2}{\theta_2} \big) \Big\} \cdot \mathbb{I}_{\{x_1 > 0\}} \mathbb{I}_{\{x_2 > 0\}}, \] where \(\boldsymbol{\theta} = (\theta_1, \theta_2)^\top \in (0, \infty) \times (0, \infty)\). Find the maximum likelihood estimate \(\widehat{\boldsymbol{\theta}}\) of the parameter vector \(\theta\) and compute the Rao-Cramer lower bound.

Example 2
Consider a random sample from the bivariate distribution given by the density function \[ f_{\boldsymbol{\theta}}(x_1, x_2) = \frac{1}{\theta_1^2 \theta_2} \frac{1}{x_2} exp\Big\{ - \big( \frac{x_1}{\theta_1 x_2} + \frac{x_2}{\theta_1 \theta_2} \big) \Big\}, \] for \(x_1, x_2 > 0\) and zero otherwise. Find the maximum likelihood estimate for the vector of parameters \(\boldsymbol{\theta} = (\theta_1, \theta_2) \in (0, \infty) \times (0, \infty)\) and calculate the Rao-Cramer lower bound for \(\widehat{\boldsymbol{\theta}}\).

Example 3
Let the random vector \(\boldsymbol{X}\) follows the multivariate normal distribution \(N_{p}(\boldsymbol{\mu}, \Sigma)\), where \(\boldsymbol{\mu} \in \mathbb{R}^p\) and \(\Sigma = Diag\{\sigma_{11}, \dots, \sigma_{pp}\}\), for some \(\sigma_{jj} > 0\), for any \(j = 1, \dots, p\). Let \(\boldsymbol{X}_{1}, \dots, \boldsymbol{X}_{n}\) be random sample drawn from the same distribution. Use the given sample and find the maximum likelihood estimates of the mean vector \(\boldsymbol{\mu}\) and the variance matrix \(\Sigma\). Derive also the Rao-Cramer lower bound for the parameter vector \(\boldsymbol{\theta} = (\mu_1, \dots, \mu_p, \sigma_{11}, \dots, \sigma_{pp})^\top\).

Example 4
Consider a multivariate random sample \(\boldsymbol{X}_1, \dots, \boldsymbol{X}_n\) from the multivariate normal distribution \(N_{p}(\boldsymbol{\mu}, \Sigma)\), for the mean vector \(\mu \in \boldsymbol{\mu} \in \mathbb{R}^p\) and some symmetric and positive definite matrix \(\Sigma \in \mathbb{R}^{p \times p}\). Find the maximum likelihood estimate of the unknown mean vector \(\boldsymbol{\mu}\) and the variance covariance matrix \(\Sigma\).

3. Statistical tests based on likelihood ratio

Example 1

Consider a random sample \(\boldsymbol{X}_1, \dots, \boldsymbol{X}_n\) from a multivariate normal distribution \(N_{p}(\boldsymbol{\mu}, \Sigma)\), where \(\boldsymbol{\mu} \in \mathbb{R}^p\) is some uknown vector of the means and \(\Sigma\) is an unknown variance-covariance matrix. Considering a specific vector \(\boldsymbol{\mu}_0 \in \mathbb{R}^p\) we would like to test the null hypothesis

\[ H_{0}: \boldsymbol{\mu} = \boldsymbol{\mu}_{0}, \] against a general alternative

\[ H_{1}: \boldsymbol{\mu} \neq \boldsymbol{\mu}_{0}. \]

Show, that under validity of the null hypothesis the maximum of the likelihood function equals to \(\ell_{0} = \ell(\mathcal{X}, \mu_0, \mathcal{S} + (\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_0)(\overline{\boldsymbol{X}}_{n} - \boldsymbol{\mu}_0)^\top))\).

Show, that the corresponding test statistic based on the likelihood ratio is \[ - 2 \log \lambda = 2 (\ell_1 - \ell_0) = n \log\Big(1 + (\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0}) ^\top \mathcal{S}^{-1} (\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0})\Big). \]

The test statistic depends on \((\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0})^\top \mathcal{S}^{-1} (\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0})\) and, moreover, under the null hypothesis validity it holds that

\[ (n - 1)(\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0}) ^\top \mathcal{S}^{-1} (\overline{\boldsymbol{X}}_n - \boldsymbol{\mu}_{0}) \sim T_{p, n - 1}^2. \] Using the Hotelling \(T^2\) distribution and its relationship with the Fisher F distribution (see also the previous lab session) it is already straightforward to perform the test using the R software. The following example shows a more general version.

Example 2
Simulate a random sample from the bivariate normal distribution \(N_{2}\Big(\Big(\begin{array}{c}1\\2\end{array}\Big), \Big( \begin{array}{cc}1 & 0.5\\0.5 & 2 \end{array} \Big)\Big)\), and test the null hypothesis \[ H_{0}: 2\mu_{1} - \mu_2 = 0.\]

Consider two situations: either the variance-covariance matrix \(\Sigma\) is known, or \(\Sigma\) is unknown. Compare the results.

library("mvtnorm")
n <- 100
mu <-  c(1, 2)
Sigma <- matrix(c(1, 0.5, 0.5, 2),2,2)

set.seed(1234)
sample <- rmvnorm(n, mu, Sigma)

The null hypothesis can be also expressed as \[ H_0: \mathbb{A}\boldsymbol{\mu} = a, \] where \(\boldsymbol{\mu} = (\mu_1, \mu_2)^\top = (1, 2)^\top\), \(\mathbb{A} = (2, -1)\), and \(a = 0\). Therefore, under the null hypothesis, we have that \[ T_{n} = n(\mathbb{A}\overline{\boldsymbol{X}}_{n} - a)^\top(\mathbb{A}\Sigma \mathbb{A}^\top)^{-1} (\mathbb{A} \overline{\boldsymbol{X}}_{n} - a) \sim \chi_{1}^{2}. \] Therefore, the null hypothesis is rejected for large values of the test statistic \(T_{n}\) (if \(T_{n}\) is larger than the corresponding significance level – \((1 - \alpha)\) quantile of the \(\chi_2\) distribution with 1 degree of freedom, for some given \(\alpha \in (0, 1)\)).

The value of the test statistic is

a <- 0
A <- c(2, -1) 
Tvalue <- n * t(A %*% apply(sample, 2, mean) - a) %*% solve(A %*% Sigma %*% A) %*% (A %*% apply(sample, 2, mean) - a)

which should be compared with the corresponding quantile \(\chi_1^2(0.95) = 3.8415\). The null hypothesis is, therefore, not rejected (the value of the test statistic \(T_n\) is roughly 0.37).

However, if we assume that the variance-covariance matrix \(\Sigma\) is not known, the test statistics \(T_{n}\) can not be calculated and, instead of \(\Sigma\) the corresponding estimate – the sample variance-covariance matrix has to be pluged in. Then we obtain that

\[ \tilde{T}_{n} = (n - 1)(\mathbb{A}\overline{\boldsymbol{X}}_{n} - a)^\top(\mathbb{A}\mathcal{S} \mathbb{A}^\top)^{-1} (\mathbb{A} \overline{\boldsymbol{X}}_{n} - a) \sim T_{1, n - 1}^2. \]

TTvalue <- (n - 1) * t(A %*% apply(sample, 2, mean) - a) %*% solve(A %*% cov(sample) %*% A) %*% (A %*% apply(sample, 2, mean) - a)

which should be now compared with the corresponding quantile of the Fisher distribution \(F_{1, 99}(0.95) = 3.9371\). Both tests perform quite similar which is common especially for larger sample sizes. On the other hand, for small \(n \in \mathbb{N}\) the results of both tests can be quite different. The second test needs to also account for some uncertainity coming from the estimation of the variance matrix.

Individual task

Consider a bivariate random sample of size \(n = 10\) from the distribution \(N_{2}(\boldsymbol{\mu}, \Sigma)\), for some uknown mean vector \(\boldsymbol{\mu} \in \mathbb{R}^2\) and the variance-covariance matrix \(\Sigma = \Big( \begin{array}{cc}2 & -1\\ -1 & 2 \end{array}\Big)\). Assume, that the sample mean vector \(\overline{\boldsymbol{X}}_{n}\) is equal to \((1, \frac{1}{2})^\top\). For the significance level \(\alpha = 0.05\) find the tests (critical regions and the corresponding decision rule) for the following null and alternative hypotheses:
- \[ H_0: \boldsymbol{\mu} = \Big(\begin{array}{c} 2\\ 2/3\end{array}\Big) \quad \quad \quad H_1: \boldsymbol{\mu} \neq \Big(\begin{array}{c} 2\\ 2/3\end{array}\Big) \]
- \[ H_0: \mu_1 + \mu_2 = \frac{7}{2} \quad \quad \quad H_1: \mu_1 + \mu_2 \neq \frac{7}{2} \]
- \[ H_0: \mu_1 - \mu_2 = \frac{1}{2} \quad \quad \quad H_1: \mu_1 - \mu_2 \neq \frac{1}{2} \]

Repeat the previus tests, however, the variance-covariance matrix is now asumed to be unknown and the corresponding empirical estimate – the sample variance covariance matrix \(\mathcal{S}\) is equal to \(\Big( \begin{array}{cc}2 & -1\\ -1 & 2 \end{array}\Big)\). Compare the results of the test with the scenarios where \(\Sigma\) is known.

Homework Assignment

(Deadline: 29.03.2022)

Design a small simulation study to investigate the empirical performance of the two tests mentioned above.

Firstly, simulate the data from the null hypothesis and perform both tests. Find out, what is the empirical probability of the first type error (i.e., the overalll proportion of rejecting the null hypothesis, if the null hypothesis holds).
Consider some data from the alternative hypothesis and investigate what is the empirical power of the test (i.e., the empirical probability of not rejecting the null, if the null hypothesis does not hold). Use different sample size and different alternative hypothesis to get some meaningful insight into the performance of both tests.
Provide some graphical outputs and post the figures (together with some short comments) on your website not later than Montay, 28.03.2022, 23:59 at latest.