Metropolis-Hastings algorithm for AR(1)-GARCH(1,1) model

The goal of this exercise is to introduce alternative MCMC samplers to JAGS. We will demonstrate their use on the following example.

AR(1)-GARCH(1,1) model

Consider a time-series \[ Y_t = a_0 + a_1 Y_{t-1} +\varepsilon _t, \quad t = 1, \ldots, T, \] where $\varepsilon_t | Y_1, \ldots, Y_{t-1} \sim \mathsf{N}(0, \sigma_t^2)$ and \[ \sigma_t^2 = \alpha_0 + \alpha_1 \varepsilon_{t-1}^2 + \beta_1 \sigma_{t-1}^2. \] Parameters $a_0, a_1 \in \mathbb{R}$ are unrestricted, however, $\alpha_0, \alpha_1, \beta_1 \geq 0$ (so that the variance $\sigma_t^2 \geq 0$) and $\alpha_1 + \beta_1 < 1$ (stationarity condition). The model is called $\mathsf{AR}(1)$-$\mathsf{GARCH}(1,1)$, since the mean follows Auto-Regressive sequence of order 1 and the variance follows Generalized Auto-Regressive Conditional Heteroscedasticity of orders 1 and 1.

The model is used for modelling log-returns of assets in financial markets when the volatility changes over time and exhibits clustering property (large price changes tend to be followed by large changes, and small changes by small changes). We will work with German DAX index data available in R:

data(EuStockMarkets)
dax <- diff(log(EuStockMarkets[,1]))

LokSD <- function(x, step = 20){
  RET <- numeric(length(x)-2*step)
  for(i in 1:(length(x)-2*step)){
    RET[i] <- sd(x[i:(i+2*step)])
  }
  return(RET)}
step <- 15
loksds <- LokSD(dax, step)

par(mfrow = c(1,1), mar = c(4,4.2,0.5,0.5))
plot(dax, col = "slategrey", 
     xlim = c(1991.6, 2001), xaxt = "n", 
     ylim = c(-0.06, 0.06), ylab = expression(log(DAX[t]/DAX[t-1])))
axis(1, at = c(1992:1999), labels = c(1992:1999), las = 1)
lines(loksds ~ time(dax)[(step+1):(length(dax)-step)], col = "red")
lines(-loksds ~ time(dax)[(step+1):(length(dax)-step)], col = "red")
abline(h = 0, col = "darkviolet", lty = 2)
abline(v = 1999, col = "darkviolet", lty = 2)
dens <- density(dax)
koef <- 0.04
lines(x = 1999 + dens$y*koef, y = dens$x, col = "skyblue", lwd = 2)
lines(x = 1999 + koef*dnorm(dens$x, mean(dax), sd(dax)), y = dens$x, col = "orange", lty = 2)
legend("topright", c(expression(N(hat(mu),hat(sigma) ^2)),"Density"),
       col = c("orange","skyblue"), lty = c(2,1), lwd = c(1,2), bty = "n")
legend("bottomright", c("DAX", "Local SD"),
       col = c("slategrey","red"), lty = 1, lwd = 1, bty = "n")

Let us denote $\boldsymbol{\theta} = (a_0, a_1, \alpha_0, \alpha_1, \beta_1)^\top$ the vector of unknown parameters. The likelihood for this model is by chain rule \[ p(\mathbf{Y} \,|\, \boldsymbol{\theta}) = p(Y_1 | Y_0; \boldsymbol{\theta}) \, p(Y_2 | Y_1; \boldsymbol{\theta}) \, \cdots p(Y_T | Y_{T-1}; \boldsymbol{\theta}) = \prod\limits_{t=1}^T (2\pi \sigma_t^2)^{-\frac{1}{2}} \exp \left\{- \frac{\varepsilon_t^2}{2\sigma_t^2}\right\} \] given $Y_0 = 0$, $\varepsilon_0 = 0$ and $\sigma_0^2 = 1$. Log-likelihood is also very simple to evaluate \[ \ell(\mathbf{Y} | \boldsymbol{\theta}) = - \frac{T}{2}\log(2\pi) - \frac{1}{2} \sum\limits_{t=1}^T \log(\sigma_t^2) - \frac{1}{2} \sum\limits_{t=1}^T \frac{\varepsilon_t^2}{\sigma_t^2} \] when $\varepsilon_t = Y_t - a_0 - a_1 Y_{t-1}$ and $\sigma_t^2 = \alpha_0 + \alpha_1 \varepsilon_{t-1}^2 + \beta_1 \sigma_{t-1}^2$ are used.

From experience in modelling financial markets we know that it is reasonable to assume the following independent prior distributions: \[ a_0 \sim \mathsf{N}(0,3), \quad a_1 \sim \mathsf{N}(0,3), \quad \alpha_0' = \log \alpha_0 \sim \mathsf{N}(-12.3,5), \quad \alpha_1' = \log \alpha_1 \sim \mathsf{N}(-2,5), \quad \beta_1' = \log \beta_1 \sim \mathsf{N}(-0.2,5), \] which yields unrestricted parametrization through $\boldsymbol{\theta}' = (a_0, a_1, \alpha_0', \alpha_1', \beta_1')^\top \in \mathbb{R}^5$. To fulfill the stationarity condition $\alpha_1 + \beta_1 = \exp\{\alpha_1'\} + \exp\{\beta_1'\} < 1$ we include indicator function to the pdf of prior distribution: \[ p(\boldsymbol{\theta}') \propto p(a_0) \, p(a_1) \, p(\alpha_0') \, p(\alpha_1') \, p(\beta_1') \, \mathbb{I}_{\mathcal{S}}(\alpha_1', \beta_1'), \] where $\mathcal{S} = \left\{ (s_1, s_2)^\top\in \mathbb{R}^2: e^{s_1} + e^{s_2} < 1 \right\}.$

Unfortunately, the full-conditional distributions of model parameters are difficult to work with. Hence, Gibbs sampling is very difficult to perform.

Metropolis-Hastings algorithm

We will then try to use Metropolis-Hastings algorithm to

sample a candidate $\boldsymbol{\psi}$ for a new value of model parameters $\boldsymbol{\theta}_{m+1}'$ using random walk: $\boldsymbol{\psi} \sim \mathsf{N}_5(\boldsymbol{\theta}_m', \Sigma)$;
accept the proposed value with probability \[ \alpha(\boldsymbol{\theta}_m', \boldsymbol{\psi}) = \min \left\{1, \, \exp\{\ell(\boldsymbol{\psi} | \mathbf{Y}) - \ell(\boldsymbol{\theta}_m' | \mathbf{Y})\}\right\} \] where the posterior log-density for $\boldsymbol{\theta}'$ is $$ (’ ,|, ) = + { - _{t=1}^T (t^2) - {t=1}^T

- , ; (_1’, _1’) , \ -, . \end{align} . $$

If the proposed value $\boldsymbol{\psi}$ does not fulfill the stationarity condition, it cannot be accepted as $\alpha(\boldsymbol{\theta}_m', \boldsymbol{\psi}) = 0$ and $\boldsymbol{\theta}_{m+1}' := \boldsymbol{\theta}_m'$. If the proposed value $\boldsymbol{\psi}$ yields higher posterior log-density, then it is surely accepted as $\alpha(\boldsymbol{\theta}_m', \boldsymbol{\psi}) = 1$ and $\boldsymbol{\theta}_{m+1}' := \boldsymbol{\psi}$.

Implementation in R

Metropolis-Hastings algorithm has to be written down manually. We start with the function for posterior log-density:

logposterior <- function(y, epsilon0 = 0, sigma20 = 1, y0 = 0,
                         state = c(a0 = 0, a1 = 0, lalpha0 = -12.3, lalpha1 = -2, lbeta1 = -0.2)){
  n <- length(y)
  alpha0 <- exp(state["lalpha0"])
  alpha1 <- exp(state["lalpha1"])
  beta1 <- exp(state["lbeta1"])
  if(alpha1 + beta1 < 1){
    epsilon <- y - state["a0"] - state["a1"] * c(y0, y[1:(n-1)])
    sigma2 <- numeric(n)
    sigma2[1] <- alpha0 + alpha1*epsilon0^2 + beta1*sigma20^2
    for(t in 1:(n-1)){
      sigma2[t+1] <- alpha0 + alpha1*epsilon[t]^2 + beta1*sigma2[t]
    }
    loglik <- sum(-0.5*log(2*pi*sigma2) - 0.5*(epsilon)^2/sigma2)
    loglik <- loglik - state["a0"]^2/6 - state["a1"]^2/6 - 
      (state["lalpha0"]+12.3)^2/10 - (state["lalpha1"]+2)^2/10 -(state["lbeta1"]+0.2)^2/10
  }else{
    loglik <- -Inf
  }
  
  return(loglik)
}

Let the covariance matrix $\Sigma$ defining the distribution of the random walk step be given. Then, the algorithm could be written in this way:

library("mvtnorm") # for rmvnorm()

ARGARCH_MHalg <- function(y, M, Sigma, start, nprop = 50){
  if(missing(start)){
    start <- c(a0 = 0, a1 = 0, lalpha0 = -12.3, lalpha1 = -2, lbeta1 = -0.2)
  }
  chain <- proposal <- data.frame(a0 = numeric(M), a1 = numeric(M), lalpha0 = numeric(M),
                                  lalpha1 = numeric(M), lbeta1 = numeric(M))
  
  prob <- accepted <- logproposal <- numeric(M)
  laststate <- start
  loglast <- logposterior(y, state = laststate)
      
  for(m in 1:M){
    step <- rmvnorm(1, sigma = Sigma)
    proposal[m,] <- step + laststate
    logproposal[m] <- logposterior(y, state = unlist(proposal[m,]))
    
    prob[m] <- min(1, exp(logproposal[m] - loglast))
    accepted[m] <- rbinom(1, 1, prob[m])
    if(accepted[m]){
      laststate <- chain[m,] <- proposal[m,]
      loglast <- logproposal[m]
    }else{
      chain[m,] <- laststate
    }
  }
  
  # Where the next proposal could take us?
  next_prop <- as.data.frame(rmvnorm(nprop, sigma = Sigma) + 
                               matrix(rep(unlist(laststate), nprop), nrow = nprop, byrow = T))
  colnames(next_prop) <- names(start)
  next_logp <- apply(next_prop, 1, function(theta){logposterior(y, state = unlist(theta))})
  next_prob <- pmin(1, exp(next_logp - loglast))
  
  return(list(chain = chain,
              start = start,
              proposal = proposal,
              logproposal = logproposal,
              prob = prob,
              accepted = accepted,
              laststate = laststate,
              loglast = loglast,
              next_prop = next_prop,
              next_prob = next_prob))
}

For example, let $\Sigma$ be the same as in the prior distribution:

Sigma <- diag(c(3,3,5,5,5))
M <- 1000
start <- c(a0 = 0, a1 = 0, lalpha0 = -12.3, lalpha1 = -2, lbeta1 = -0.2)

set.seed(123456789)
mcmc <- ARGARCH_MHalg(dax, M, Sigma, start)

Let us examine the trajectory of the sampled MCMC:

traceplots <- function(mcmc, B = 0, M = nrow(mcmc$chain), thin = 1, COL = "slategrey"){
  LABS <- c("a0", "a1", "log(alpha0)", "log(alpha1)", "log(beta1)")
  names(LABS) <- c("a0", "a1", "lalpha0", "lalpha1", "lbeta1")
  inds <- seq(B+1, M, by = thin)
  
  par(mfrow = c(2,3), mar = c(4,4,0.8,0.8))
  AllData <- mcmc$chain[inds,]
  Maxs <- as.data.frame(lapply(AllData, max))
  Mins <- as.data.frame(lapply(AllData, min))
  
  for(name in c("a0", "a1", "lalpha0", "lalpha1", "lbeta1")){
    plot(mcmc$chain[inds,name] ~ inds, 
         ylab = LABS[name], xlab = "Index", type= "l")
  }
  
  # probability acceptance plot
  plot(cumsum(mcmc$accepted[inds])/1:length(inds) ~ inds, type = "l", 
       xlab = "Index", ylab = "Probability", ylim = c(0,1), col = COL, lwd = 1)
  legend("topright", paste0(round(mean(mcmc$accepted), digits = 4)*100, " %"),
         col = COL, lty = 1, bty = "n", title = "Acceptance rate")
}

traceplots(mcmc)

None of the proposed states has been accepted! Even though covariance matrix $\Sigma$ can be theoretically chosen arbitrarily for Metropolis-Hastings to work, in practice the choice is crucial for obtaining reasonable MCMC approximation. Since the log-posterior for initial state is 5927 and for proposed values we have such summary:

sum(mcmc$logproposal == -Inf)

## [1] 645

summary(mcmc$logproposal[is.finite(mcmc$logproposal)])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1116794   -31502    -8917   -47254    -4506     3244

the difference between them is 2683, thus, the highest observed acceptance probability is exp(-2683), which is nearly zero.

Choice of proposal distribution

In order to increase the acceptance rate we need much more appropriate matrix $\Sigma$. The difference in posterior log-density should be of unit size so that exp of the difference yields acceptance probability under which we can actually succeed. This is achieved when the proposal distribution resembles the posterior. The problem is that we do not know, what the posterior is, that is why we use MCMC in the first place.

The problem with our previous choice of $\Sigma$ was that it proposed too large steps (prior is flat compared to posterior). We need to reduce the scales appropriately.

diagS <- diag(c(0.005, 0.1, 0.5, 0.5, 0.05))
corrS <- diag(5)
Sigma <- diagS %*% corrS %*% diagS
mcmc <- ARGARCH_MHalg(dax, M, Sigma, start)
traceplots(mcmc)

image_partial_posterior <- function(data,
                                    mcmc, 
                                    vars = c("lalpha1", "lbeta1"),
                                    XLIM = c(-4, -0.02),
                                    YLIM = c(-0.7, -0.02),
                                    fixvalues = c(a0=0.000795, a1=0.00730, lalpha0=-11.98,
                                                  lalpha1=-2.05, lbeta1=-0.211, alpha0=exp(-11.98), 
                                                  alpha1=exp(-2.05), beta1 = exp(-0.211)),
                                    grid.size = 30,
                                    Ntercol = 100,
                                    add_trace = F,
                                    add_where_next = F){
  grid1 <- seq(from = XLIM[1], to = XLIM[2], length.out = grid.size)
  grid2 <- seq(from = YLIM[1], to = YLIM[2], length.out = grid.size)
  griddif1 <- (max(grid1)-min(grid1))/(grid.size-1)
  griddif2 <- (max(grid2)-min(grid2))/(grid.size-1)
  
  Z <- sapply(grid2, function(y){
    sapply(grid1, function(x){
      auxstate <- fixvalues
      auxstate[vars] <- c(x, y)
      logposterior(data, state = auxstate)
    })
  })
  Z[is.infinite(Z)] <- min(Z[is.finite(Z)]) - 1
  
  image(grid1, grid2, Z, col = terrain.colors(Ntercol), 
        xlab = vars[1], ylab = vars[2])
  contour(grid1, grid2, z = Z, add = T, drawlabels = T)
  
  # stationarity condition for lalpha1 a lbeta1
  if(is.element("lalpha1", vars) & is.element("lbeta1", vars)){
    lines(x = grid1, y = log(1-exp(grid1)), col = "darkviolet", lty = 2, lwd = 2)
  }
  # other cases when just one of these is asked to plot
  if(vars[1] == "lalpha1" & vars[2] != "lbeta1"){
    abline(v = log(1-exp(fixvalues["lbeta1"])), col = "darkviolet", lty = 2, lwd = 2)
  }
  if(vars[2] == "lalpha1" & vars[1] != "lbeta1"){
    abline(h = log(1-exp(fixvalues["lbeta1"])), col = "darkviolet", lty = 2, lwd = 2)
  }
  if(vars[1] != "lalpha1" & vars[2] == "lbeta1"){
    abline(h = log(1-exp(fixvalues["lalpha1"])), col = "darkviolet", lty = 2, lwd = 2)
  }
  if(vars[2] != "lalpha1" & vars[1] == "lbeta1"){
    abline(v = log(1-exp(fixvalues["lalpha1"])), col = "darkviolet", lty = 2, lwd = 2)
  }
  
  if(add_trace){
    states <- rbind(mcmc$start[vars],
                    mcmc$chain[as.logical(mcmc$accepted), vars])
    numstays <- diff(c(0, which(as.logical(mcmc$accepted))))
    fnumstays <- cut(numstays, breaks = c(0, 1, 3, 5, 10, 20, Inf),
                     labels = c("1", "2-3", "4-5", "6-10", "11-20", ">20"))
    CEX <- seq(0.4, 2, length.out = nlevels(fnumstays))
    names(CEX) <- levels(fnumstays)
    
    lines(states, type = "b", col = "black", lwd = 1, lty = 1, 
          cex = CEX[fnumstays], pch = 21, bg = "slategrey")
  }
  
  if(add_where_next){
    arrows(x0 = as.numeric(mcmc$laststate[vars[1]]),
           y0 = as.numeric(mcmc$laststate[vars[2]]),
           x1 = mcmc$next_prop[,vars[1]],
           y1 = mcmc$next_prop[,vars[2]],
           col = "orchid", lty = 2)
  }
  
}

par(mfrow = c(1,3), mar = c(4,4,1,0.5))
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("a0", "a1"),
                        XLIM = c(-0.01,0.015), YLIM = c(-1,1),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha0", "lalpha1"),
                        XLIM = c(-13,-11), YLIM = c(-3,-1),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha1", "lbeta1"),
                        XLIM = c(-3,-1), YLIM = c(-0.5,-0.05),
                        add_trace = T, add_where_next = T)

Finally, couple of accepted states. But this is still not enough.

diagS <- diag(c(0.001, 0.1, 0.1, 0.1, 0.01))
corrS <- diag(5)
Sigma <- diagS %*% corrS %*% diagS
mcmc <- ARGARCH_MHalg(dax, M, Sigma, start)
traceplots(mcmc)

par(mfrow = c(1,3), mar = c(4,4,1,0.5))
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("a0", "a1"),
                        XLIM = c(-0.001,0.0028), YLIM = c(-0.35,0.35),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha0", "lalpha1"),
                        XLIM = c(-12.5,-11.5), YLIM = c(-2.4,-1.7),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha1", "lbeta1"),
                        XLIM = c(-2.4,-1.7), YLIM = c(-0.245,-0.175),
                        add_trace = T, add_where_next = T)

Better, but still not great.

diagS <- diag(c(0.0005, 0.05, 0.15, 0.08, 0.012))
corrS <- diag(5)
Sigma <- diagS %*% corrS %*% diagS
mcmc <- ARGARCH_MHalg(dax, M, Sigma, start)
traceplots(mcmc)

par(mfrow = c(1,3), mar = c(4,4,1,0.5))
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("a0", "a1"),
                        XLIM = c(-0.0005,0.0018), YLIM = c(-0.2,0.22),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha0", "lalpha1"),
                        XLIM = c(-12.5,-11.5), YLIM = c(-2.4,-1.7),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha1", "lbeta1"),
                        XLIM = c(-2.4,-1.7), YLIM = c(-0.245,-0.175),
                        add_trace = T, add_where_next = T)

What would happen if we reduced standard deviations too much?

diagS <- diag(c(0.000001, 0.001, 0.01, 0.005, 0.001))
corrS <- diag(5)
Sigma <- diagS %*% corrS %*% diagS
mcmc <- ARGARCH_MHalg(dax, M, Sigma, start=c(
  a0=0.000795, a1=0.00730, lalpha0=-11.98, lalpha1=-2.05, lbeta1=-0.211))
traceplots(mcmc)

par(mfrow = c(1,3), mar = c(4,4,1,0.5))
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("a0", "a1"),
                        XLIM = c(-0.0001,0.0015), YLIM = c(-0.1,0.1),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha0", "lalpha1"),
                        XLIM = c(-12.5,-11.5), YLIM = c(-2.4,-1.7),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha1", "lbeta1"),
                        XLIM = c(-2.4,-1.7), YLIM = c(-0.245,-0.175),
                        add_trace = T, add_where_next = T)

Acceptance is high, however, we would stay long in certain parts of posterior and it would take long time to move somewhere else. The short steps also result in high autocorrelation, which complicates the inference even further.

Finally, we arrive at reasonable choice of standard deviations:

diagS <- diag(c(0.0003, 0.012, 0.13, 0.08, 0.012))
corrS <- diag(5)
Sigma <- diagS %*% corrS %*% diagS
mcmc <- ARGARCH_MHalg(dax, M, Sigma, start)
traceplots(mcmc)

par(mfrow = c(1,3), mar = c(4,4,1,0.5))
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("a0", "a1"),
                        XLIM = c(-0.0001,0.0015), YLIM = c(-0.1,0.1),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha0", "lalpha1"),
                        XLIM = c(-12.5,-11.5), YLIM = c(-2.4,-1.7),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha1", "lbeta1"),
                        XLIM = c(-2.4,-1.7), YLIM = c(-0.245,-0.175),
                        add_trace = T, add_where_next = T)

We also notice that the posteriors for $\alpha_0'$, $\alpha_1'$ and $\beta_1'$ seem to be correlated:

cor(mcmc$chain)

##                  a0          a1     lalpha0     lalpha1      lbeta1
## a0       1.00000000 -0.15105622 -0.03846381  0.06948551 -0.09375933
## a1      -0.15105622  1.00000000 -0.03703793 -0.12431537  0.18815676
## lalpha0 -0.03846381 -0.03703793  1.00000000 -0.47330453 -0.33768792
## lalpha1  0.06948551 -0.12431537 -0.47330453  1.00000000 -0.49061351
## lbeta1  -0.09375933  0.18815676 -0.33768792 -0.49061351  1.00000000

This is why we add also some correlations into the proposal distribution

diagS <- diag(c(0.0003, 0.012, 0.13, 0.08, 0.012))
corrS <- matrix(c(1, 0.0, 0.0, 0.0, 0.0,
                  0.0, 1, 0.0, 0.0, 0.0,
                  0.0, 0.0, 1, -0.4, -0.5,
                  0.0, 0.0, -0.4, 1, -0.3, 
                  0.0, 0.0, -0.5, -0.3, 1), nrow = 5, byrow = T)
Sigma <- diagS %*% corrS %*% diagS
mcmc <- ARGARCH_MHalg(dax, M, Sigma, start)
traceplots(mcmc)

par(mfrow = c(1,3), mar = c(4,4,1,0.5))
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("a0", "a1"),
                        XLIM = c(-0.0001,0.0015), YLIM = c(-0.1,0.1),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha0", "lalpha1"),
                        XLIM = c(-12.5,-11.5), YLIM = c(-2.4,-1.7),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha1", "lbeta1"),
                        XLIM = c(-2.4,-1.7), YLIM = c(-0.245,-0.175),
                        add_trace = T, add_where_next = T)

It seems that it improves the acceptance probability and the exploration of the posterior. But there is also some downside to having correlations within the proposal density. Imagine we would start from a point far away from where the posterior is concentrated.

diagS <- diag(c(0.0003, 0.012, 0.13, 0.08, 0.012))
corrS <- matrix(c(1, 0.0, 0.0, 0.0, 0.0,
                  0.0, 1, 0.0, 0.0, 0.0,
                  0.0, 0.0, 1, -0.7, -0.7,
                  0.0, 0.0, -0.7, 1, -0.7, 
                  0.0, 0.0, -0.7, -0.7, 1), nrow = 5, byrow = T)
Sigma <- diagS %*% corrS %*% diagS
mcmc <- ARGARCH_MHalg(dax, M, Sigma, start=c(
  a0=-0.001, a1=-0.01, lalpha0=-13, lalpha1=-3, lbeta1=-0.3))
traceplots(mcmc)

par(mfrow = c(1,3), mar = c(4,4,1,0.5))
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("a0", "a1"),
                        XLIM = c(-0.0012,0.002), YLIM = c(-0.1,0.1),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha0", "lalpha1"),
                        XLIM = c(-16,-11.5), YLIM = c(-3.5,-1.7),
                        add_trace = T, add_where_next = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha1", "lbeta1"),
                        XLIM = c(-3.5,-1.7), YLIM = c(-0.35,-0.05),
                        add_trace = T, add_where_next = T)

The correlations make it more difficult to move in the direction of steepest ascent. The chain is then more likely to zig-zag, which prolongs the burn-in phase.

Idea: Find reasonable starting value first and use correlations only then.

Finally we arrive to the following setting:

diagS <- diag(c(0.0003, 0.012, 0.13, 0.08, 0.012))
corrS <- matrix(c(1, 0.0, 0.0, 0.0, 0.0,
                  0.0, 1, 0.0, 0.0, 0.0,
                  0.0, 0.0, 1, -0.5, -0.45,
                  0.0, 0.0, -0.5, 1, -0.3, 
                  0.0, 0.0, -0.45, -0.3, 1), nrow = 5, byrow = T)
Sigma <- diagS %*% corrS %*% diagS
mcmc <- ARGARCH_MHalg(dax, M = 10000, Sigma, start)
traceplots(mcmc)

par(mfrow = c(1,3), mar = c(4,4,1,0.5))
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("a0", "a1"),
                        XLIM = c(-0.0001,0.0015), YLIM = c(-0.1,0.1),
                        add_trace = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha0", "lalpha1"),
                        XLIM = c(-12.5,-11.5), YLIM = c(-2.4,-1.7),
                        add_trace = T)
image_partial_posterior(data = dax, mcmc = mcmc, vars = c("lalpha1", "lbeta1"),
                        XLIM = c(-2.4,-1.7), YLIM = c(-0.245,-0.175),
                        add_trace = T)

summary(mcmc$chain)

##        a0                   a1               lalpha0          lalpha1           lbeta1       
##  Min.   :-5.327e-05   Min.   :-0.087638   Min.   :-12.46   Min.   :-2.397   Min.   :-0.2449  
##  1st Qu.: 6.552e-04   1st Qu.:-0.011471   1st Qu.:-12.07   1st Qu.:-2.112   1st Qu.:-0.2177  
##  Median : 7.986e-04   Median : 0.006885   Median :-11.98   Median :-2.050   Median :-0.2111  
##  Mean   : 7.960e-04   Mean   : 0.007507   Mean   :-11.98   Mean   :-2.054   Mean   :-0.2110  
##  3rd Qu.: 9.359e-04   3rd Qu.: 0.027635   3rd Qu.:-11.88   3rd Qu.:-1.997   3rd Qu.:-0.2044  
##  Max.   : 1.536e-03   Max.   : 0.104214   Max.   :-11.53   Max.   :-1.773   Max.   :-0.1781

apply(mcmc$chain, 2, sd)

##           a0           a1      lalpha0      lalpha1       lbeta1 
## 0.0002095183 0.0280057539 0.1365690414 0.0912294258 0.0096818757

par(mfrow = c(1,1), mar = c(4,4,4,4))
psych::pairs.panels(mcmc$chain)

par(mfrow = c(2,3), mar = c(4,4,4,0.5))
for(i in 1:5){
  acf(mcmc$chain[,i], main = colnames(mcmc$chain)[i])
}

Conclusion

Setting up Metropolis-Hastings algorithm takes some time and effort!
The proposal distribution has to resemble the posterior distribution, which is difficult to imitate without insight.

It would be great if the proposal distribution could be tuned automatically.

One idea comes from Laplace approximation. Locally, posterior log-density $\ell(\boldsymbol{\theta}' \,|\, \mathbf{Y})$ can be approximated by a quadratic form (Taylor expansion): \[ \ell(\boldsymbol{\theta}' \,|\, \mathbf{Y}) = \ell(\widehat{\boldsymbol{\theta}}' \,|\, \mathbf{Y}) + \left. \dfrac{\partial \ell(\boldsymbol{\theta}' \,|\, \mathbf{Y})}{\partial \boldsymbol{\theta}'} \right|_{\boldsymbol{\theta}' = \widehat{\boldsymbol{\theta}}'} \left(\boldsymbol{\theta}' - \widehat{\boldsymbol{\theta}}'\right) + \frac{1}{2} \left(\boldsymbol{\theta}' - \widehat{\boldsymbol{\theta}}'\right)^\top \left.\dfrac{\partial^2 \ell(\boldsymbol{\theta}' \,|\, \mathbf{Y})}{\partial \boldsymbol{\theta}' \partial \boldsymbol{\theta}'^\top} \right|_{\boldsymbol{\theta}' = \widehat{\boldsymbol{\theta}}'} \left(\boldsymbol{\theta}' - \widehat{\boldsymbol{\theta}}'\right) + \cdots \]

If $\widehat{\boldsymbol{\theta}}'$ maximizes $\ell(\boldsymbol{\theta}' \,|\, \mathbf{Y})$, then as a stationary point the gradient has to be zero: $\left.\dfrac{\partial \ell(\boldsymbol{\theta}' \,|\, \mathbf{Y})}{\partial \boldsymbol{\theta}'} \right|_{\boldsymbol{\theta}' = \widehat{\boldsymbol{\theta}}'} = \mathbf{0}$ and the approximation can be written as \[ \ell(\boldsymbol{\theta}' \,|\, \mathbf{Y}) \approx \mathrm{const.} - \frac{1}{2} \left(\boldsymbol{\theta}' - \widehat{\boldsymbol{\theta}}'\right)^\top \left[-\left.\dfrac{\partial^2 \ell(\boldsymbol{\theta}' \,|\, \mathbf{Y})}{\partial \boldsymbol{\theta}' \partial \boldsymbol{\theta}'^\top} \right|_{\boldsymbol{\theta}' = \widehat{\boldsymbol{\theta}}'}\right] \left(\boldsymbol{\theta}' - \widehat{\boldsymbol{\theta}}'\right), \] which resembles the log-pdf of multivariate normal distribution. Hence, the negative Hess matrix evaluated at $\widehat{\boldsymbol{\theta}}'$ is reasonable choice for $\Sigma^{-1}$ if proposals are taken close to the posterior mode $\widehat{\boldsymbol{\theta}}'$.