MCMC

Construction of Markov chain

Suppose we follow the Metropolis-Hastings algorithm, which guarantees that the target PDF satisfies the global balance equations given the new transition density \(\tilde{q}(y|x)\). Then, the rest of our task is to construct a base transition density \(q(y|x)\), which is ergodic.

Construction of some ergodic transition density \(q(y|x)\) is easy (e.g., \(\text{N}(x,\sigma^2I)\)). Construction of good, sample-efficient one is hard, especially hard in high dimensions. If you are interested, learn about the Hamiltonian MC.

We saw some examples of continuous-state MCMC in Week 11 lab. Here are two finite-state examples. (In practice, no one should use MCMC to sample from a finite PMF.)

# Target
pi <- 9:16
pi <- pi/sum(pi)
pi
## [1] 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16
# Uniformly random
m <- length(pi)
Q1 <- matrix(0, m, m) 
for (i in 1:m) {
  p <- runif(m)
  p <- p/sum(p)
  Q1[i,] <- p
}
round(Q1,3)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
## [1,] 0.109 0.202 0.077 0.054 0.095 0.028 0.256 0.178
## [2,] 0.109 0.085 0.104 0.095 0.182 0.178 0.163 0.085
## [3,] 0.048 0.159 0.065 0.214 0.087 0.221 0.087 0.119
## [4,] 0.208 0.035 0.065 0.161 0.034 0.218 0.071 0.208
## [5,] 0.205 0.254 0.022 0.136 0.192 0.087 0.021 0.082
## [6,] 0.246 0.196 0.107 0.022 0.099 0.050 0.073 0.206
## [7,] 0.210 0.212 0.193 0.004 0.040 0.064 0.202 0.074
## [8,] 0.144 0.138 0.055 0.115 0.182 0.183 0.007 0.176
# Random walk on a complete graph
Q2 <- matrix(0, m, m)
for (i in 1:m) {
  Q2[i,-i] <- 1/(m-1)
}
round(Q2,3)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
## [1,] 0.000 0.143 0.143 0.143 0.143 0.143 0.143 0.143
## [2,] 0.143 0.000 0.143 0.143 0.143 0.143 0.143 0.143
## [3,] 0.143 0.143 0.000 0.143 0.143 0.143 0.143 0.143
## [4,] 0.143 0.143 0.143 0.000 0.143 0.143 0.143 0.143
## [5,] 0.143 0.143 0.143 0.143 0.000 0.143 0.143 0.143
## [6,] 0.143 0.143 0.143 0.143 0.143 0.000 0.143 0.143
## [7,] 0.143 0.143 0.143 0.143 0.143 0.143 0.000 0.143
## [8,] 0.143 0.143 0.143 0.143 0.143 0.143 0.143 0.000

\(Q_1\) is almost surely ergodic because its entries are randomly picked (hence, strictly between \(0\) and \(1\)).

A MCMC estimate under \(Q_1\).

alpha1 <- function(x, y) min(pi[y]*Q1[y,x]/(pi[x]*Q1[x,y]), 1)

n <- 1e6
x <- rep(0, n)
x[1] <- sample(1:m, 1)
cnt <- 0
for(i in 2:n){
  y <- sample(1:m, 1, prob=Q1[x[i-1],])
  if(runif(1) < alpha1(x[i-1],y)){
    x[i] <- y
    cnt <- cnt + 1
  } else {
    x[i] <- x[i-1]
  }
}

burn_in <- 0.01*n
round(table(x[(burn_in+1):n])/(n-burn_in), 2)
## 
##    1    2    3    4    5    6    7    8 
## 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16
paste("Acceptance rate of Q1 is", cnt/n)
## [1] "Acceptance rate of Q1 is 0.697752"

A MCMC estimate under \(Q_2\).

alpha2 <- function(x, y) min(pi[y]*Q2[y,x]/(pi[x]*Q2[x,y]), 1)

n <- 1e6
x <- rep(0, n)
x[1] <- sample(1:m, 1)
cnt <- 0
for(i in 2:n){
  y <- sample(1:m, 1, prob=Q2[x[i-1],])
  if(runif(1) < alpha2(x[i-1],y)){
    x[i] <- y
    cnt <- cnt + 1
  } else {
    x[i] <- x[i-1]
  }
}

burn_in <- 0.01*n
round(table(x[(burn_in+1):n])/(n-burn_in), 2)
## 
##    1    2    3    4    5    6    7    8 
## 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16
paste("Acceptance rate is", cnt/n)
## [1] "Acceptance rate is 0.880191"

Gibbs sampling as a Metropolis-Hastings

Suppose \(x=(x^{(1)},\dots,x^{(d)})\). Then, in a transition step from \(X_{i}=x_i\) to \(X_{i+1}=x_{i+1}\), there are \(d\) sub-steps in the Gibbs sampling:

  1. Sample \(x_{i+1}^{(1)}\) from \(f^{(1)}(x^{(1)}|x_i^{(2)},\dots,x_i^{(d)})\)
  2. Sample \(x_{i+1}^{(2)}\) from \(f^{(2)}(x^{(2)}|x_{i+1}^{(1)}, x_i^{(3)},\dots,x_i^{(d)})\)
  3. Sample \(x_{i+1}^{(3)}\) from \(f^{(3)}(x^{(3)}|x_{i+1}^{(1)},x_{i+1}^{(2)} x_i^{(4)},\dots,x_i^{(d)})\)
    \(\vdots\)
  1. Sample \(x_{i+1}^{(d)}\) from \(f^{(d)}(x^{(d)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(d-1)})\)

Looking at a sample \(x_{i+1}^{(k)}\) drawn from \[f^{(k)}(x_{i+1}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)}),\] we may put \(x_{i+1}^{(k)}\) in a full-length vector \(x\) and index it as \(j+1\). That is, \[\begin{align} \vdots&\\ x_{j} &= (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ x_{j+1} &= (x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i+1}^{(k)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ \vdots& \end{align}\] Clearly, the joint density is \[f(x) = f^{(k)}(x^{(k)}|x^{(-k)}) f^{(-k)}(x^{(-k)}),\] where we use \(x^{(-k)} = x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)}\).

Note that the transition densities of \(x_j\to x_{j+1}\) and \(x_{j+1}\to x_j\) are both given by the conditional PDF for \(x^{(k)}\), \[\begin{align} q(x_{j+1}|x_j) &= f^{(k)}(x_{i+1}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ &= f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})\\ q(x_j|x_{j+1}) &= f^{(k)}(x_{i}^{(k)}|x_{i+1}^{(1)},\dots,x_{i+1}^{(k-1)},x_{i}^{(k+1)},\dots,x_{i}^{(d)})\\ &= f^{(k)}(x_{i}^{(k)}|x^{(-k)}). \end{align}\]

Now, it follows that \[\begin{align} \alpha(x_{j},x_{j+1}) &= \min\left(1,\; \frac{f(x_{j+1})q(x_{j}|x_{j+1})}{f(x_{j})q(x_{j+1}|x_{j})}\right)\\ &= \min\left(1,\; \frac{f^{(k)}(x_{i+1}^{(k)}|x^{(-k)}) f^{(-k)}(x^{(-k)})f^{(k)}(x_{i}^{(k)}|x^{(-k)})}{f^{(k)}(x_{i}^{(k)}|x^{(-k)}) f^{(-k)}(x^{(-k)})f^{(k)}(x_{i+1}^{(k)}|x^{(-k)})}\right)\\ &= \min(1, 1)\\ &= 1 \end{align}\]

Bayesian statistics

(If you are interested in being a Bayesian, consider taking MAST90125 Bayesian Statistical Learning in Semester 2.)

In statistics, we are interested in learning about unknown quantities of interest (e.g., system parameters) from data. Bayesians approach the problem by modelling those unknown quantities as random variables.

Suppose we model that some system of interest is governed by parameters \(\alpha\) and observe system outputs \(x\). First, to make it learnable, we restrict \(\alpha\) to a certain class of probability distributions. In addition, even given \(\alpha\), we do not think our model completely captures the underlying deterministic mechanism. So, we assume \(x\) follows some probability distribution parameterised by \(\alpha\).

In Bayesian statistics,

The central object to learn about is the posterior, which is given by the Bayes’ formula: \[f(\alpha|x) = \frac{f(x,\alpha)}{f(x)} = \frac{f(x|\alpha)f(\alpha)}{\int f(x|\alpha)f(\alpha) d\alpha} \propto f(x|\alpha)f(\alpha).\]

In the classical (a.k.a. frequentist) statistics, the same mathematical expression \(f(x;\alpha) = f(x|\alpha)\) is viewed as a function of \(\alpha\) for fixed \(x\). For example, \(\hat{\alpha}_{MLE} = \underset{\alpha}{\operatorname{argmax}} f(x;\alpha)\).

Hierarchical models

In practice, assuming \(\alpha \sim f(\alpha)\) is often too strong/rigid to flexibly capture complex phenomena. As stated in the second paragraph above, we may want to restrict \(\alpha\) only to a class of distributions instead of a distribution. One of the easiest approaches is to introduce another conditional PDF for \(\alpha\) given, say, \(\beta\). Now, we have a richer model: \[f(\alpha,\beta|x) = \frac{f(x,\alpha,\beta)}{f(x)} \propto f(x|\alpha)f(\alpha|\beta)f(\beta).\]

If our interest is the marginal posterior PDF for \(\alpha\), we can integrate out hyperparameter \(\beta\): \[f(\alpha|x) = \int f(\alpha,\beta|x) d\beta.\]

There are often multiple integrals involved in Bayesian hierarchical models. Aside from \(f(x)=\int f(x|\alpha)f(\alpha) d\alpha\), a key one is \[\begin{align} \mathbb{E}[\alpha|x] &= \int \alpha f(\alpha|x) d\alpha\\ &= \int \alpha \int f(\alpha,\beta|x) d\beta d\alpha. \end{align}\] Modellers have/create choices for which integral to use what method. For example,

  • Use conjugate priors to make some integrals analytically tractable or amenable to the Gibbs sampler.
  • Keep the model faithful to the existing domain knowledge and deal with difficult integrals using MCMC.
  • Blend MCMC and the acceptance-rejection sampling in the Gibbs sampling, where some sub-vectors are directly sampled by off-the-shelf samplers while other sub-vectors are handled by MCMC or the acceptance-rejection.

Choices really depend on modellers and problem situations.

Of course, nothing prevents us from going even deeper and introducing another level to the hierarchy. In practice, however, we often give up being completely Bayesian at some level and just assume a prior for the end of the hierarchy.

Empirical Bayes

In other times in practice, we also give up assuming a prior for \(\beta\). In this case, \[f(\alpha|\beta,x) = \frac{f(x,\alpha|\beta)}{f(x)} \propto f(x|\alpha)f(\alpha|\beta).\] Clearly, to use the knowledge \(f(\alpha|\beta,x)\) obtained from data \(x\), we need a value of \(\beta\). What value and how to get it?

Type II maximum likelihood is an approach to obtaining a reasonable value of \(\beta\). To proceed, first, integrate out \(\alpha\) from \(f(x,\alpha|\beta)\) to get \(f(x|\beta)\) — the marginal likelihood of \(\beta\). Then, maximise \(f(x|\beta)\) to obtain \[\hat{\beta} = \underset{\beta}{\operatorname{argmax}} f(x|\beta).\]

Now, with this \(\hat{\beta}\), make inferences on \(\alpha\) based on \(f(\alpha|\hat{\beta},x)\).

I believe this is the reason for the name — Type II maximum likelihood. In contrast to the standard maximum likelihood, which maximises the likelihood \(f(x|\alpha)\), the direct parameter \(\alpha\) is integrated out and the marginal likelihood \(f(x|\beta)\) is instead maximised with respect to the hyperparameter \(\beta\).

Note that the use of point estimate \(\hat{\beta}\) is conceptually quite different from obtaining \(f(\alpha|x)\) by integrating out \(\beta\) from \(f(\alpha,\beta|x)\), which averages out the effect of hyperparameter over all possible values of \(\beta\). Although starting with the prior assumption on \(\alpha\), empirical Bayes is much less Bayesian.

Bayesian optimisation

Performance of Bayesian optimisation depends on qualities of both surrogate models and acquisition functions. While some researchers emphasise the latter, to me, the former tends to be more important in practice — how well a GP captures key characteristics of the objective function.

GP hyperparameters

Assume \(m(x)=0, \forall x\). Then, performance of GP solely depends on quality of covariance function \(k(x,x')\). Following the Bayesian hierarchical approach, to increase the flexibility of covariance function, we can introduce hyperparameters to \(k\).

Previously, we have \(k(x,x')=\exp(-\|x-x'\|^2)\). Now, introduce two hyperparameters \(\sigma^2\) and \(l\), which may be interpreted as variance and length-scale respectively, in the following way: \[k(x,x';\sigma^2,l) = \sigma^2\exp\left(-\left\|\frac{x-x'}{l}\right\|^2\right).\] \(\sigma^2\) is easy to interpret by noting \(k(x,x)=\sigma^2\), which is the variance of \(Y(x)\). Now, we have flexibility in the spread of normal random variables.

Next, remember that \(k(x,x')\) measures the similarity of \(Y(x)\) and \(Y(x')\) based on the distance between \(x\) and \(x'\). Clearly, there are fixed upper and lower bounds: \(\|x-x'\|=0\) implies \(\text{Cov}(Y(x),Y(x'))=\sigma^2\) and \(\|x-x'\|=\infty\) implies \(\text{Cov}(Y(x),Y(x'))=0\). However, it would be nice to have some flexibility in between because the correspondence between distance in \(x\) and similarity in \(Y(x)\) may not be so obvious. The length-scale hyperparameter \(l\) allows us to adjust this correspondence. The larger \(l\), the more similar \(Y(x)\) and \(Y(x')\) for the given distance between \(x\) and \(x'\).

The current most basic one is classified as isotropic: \(k(x,x')=k_1(\|x-x'\|)\). There are just so many other variations in covariance functions. For example, stationary but anisotropic \(k(x,x')=k_2(x-x')\) or even non-stationary.

Prior, likelihood, and marginal likelihood

In practice, Type II maximum likelihood is commonly used to fit GP hyperparameters to data. What is the marginal likelihood here?

Give the fixed inputs \(x\) to a GP, the prior is \(\text{N}(0,K)\). If we assume a noise-less objective function, responses \(Y(x)\) are deterministic so there is no likelihood. Hence, the joint distribution formed by the prior and likelihood is just the prior.

To see this, let’s write \(g\sim\mathcal{GP}(0,k)\). The noise-less data implies \(Y(x)=g(x)\) for all input \(x\), which is why we may write \(Y\sim\mathcal{GP}(0,k)\). Consequently, the prior PDF already provides the marginal likelihood (i.e., the conditional PDF for \(y\) given \(\sigma^2\) and \(l\)): \[f(y|\sigma^2,l) = f(g|\sigma^2,l) = (2\pi)^{-\frac{n}{2}}|K|^{-\frac{1}{2}}\exp\left(-\frac{1}{2}y'K^{-1}y\right),\] where \(K\) is a function of \(\sigma^2\) and \(l\).

If the noise is assumed to be IID normal \(\epsilon \sim \text{N}(0,\sigma^2_nI)\), then we have \(Y(x) = g(x)+\epsilon\) and the likelihood is \(Y(x)|g(x) \sim \text{N}(g(x),\sigma^2_nI)\). In this case, the marginal likelihood is, \[\begin{align} f(y|\sigma^2,l) &= \int f(y|g)f(g|\sigma^2,l)dg\\ &= (2\pi)^{-\frac{n}{2}}|K+\sigma^2_nI|^{-\frac{1}{2}}\exp\left(-\frac{1}{2}y'(K+\sigma^2_nI)^{-1}y\right). \end{align}\]

Thus, in either \(\sigma^2_n=0\) or not, \[(\hat{\sigma}^2,\hat{l}) = \underset{\sigma^2,l}{\operatorname{argmax}} f(y|\sigma^2,l).\]

Or, if \(\sigma^2_n\neq0\) is also optimised as a hyperparameter, \[(\hat{\sigma}^2,\hat{l},\hat{\sigma}^2_n) = \underset{\sigma^2,l,\sigma^2_n}{\operatorname{argmax}} f(y|\sigma^2,l,\sigma^2_n).\]

Styles of exam questions

The best advice is, unsurprisingly, focus on your understanding and ready to apply each method in simple problems to show your understanding.

1

The following code simulates a random variable \(Y\). What is the probability mass/density function of \(Y\)? Briefly explain your answer.

x <- rexp(1,2)
y <- 1 - exp(-2*x)

Solution

rexp(1,2) simulates the exponential distribution with rate 2, and \(1-\exp(-2x)\) is the CDF for all \(x\) in its support. So, a sample y is drawn from the uniform distribution over \([0,1]\) via the inverse transform method. Its PDF is \(1\) for \(x\in[0,1]\) and \(0\) otherwise.

2

est1 and est2 in the following code are MC estimates of some probability mass/density functions. Name each of the distributions with specific parameter value(s). Briefly explain your answer.

a <- 10
b <- 0.6
c <- 1e5

x1 <- c()
x2 <- c()
for (i in 1:c) {
  x <- rpois(1, a)
  if (runif(1) < b) {
    x1 <- c(x1, x)
  } else {
    x2 <- c(x2, x)
  }
}
est1 <- table(x1)/length(x1)
est2 <- table(x2)/length(x2)

Solution

Both estimate the PMF for pois(10).

par(mfrow=c(1,3))
barplot(est1,
        ylim=c(0,.15),
        main="est1",
        xlab="x",
        ylab=expression(P(X==x)))

barplot(est2,
        ylim=c(0,.15),
        main="est2",
        xlab="x",
        ylab=expression(P(X==x)))

n <- max(length(est1),length(est2)) - 1
x <- 0:n
barplot(dpois(x, a),
        ylim=c(0,.15),
        main=paste("pois(", a, ")", sep=""),
        xlab="x",
        ylab=expression(P(X==x)),
        names.arg = x)

3

What is the output of the following code? No partial mark will be given, so be careful.

v <- c(0, 1)
for (i in 3:9) {
  v <- c(v, v[i-1] + v[i-2])
}
v
## [1]  0  1  1  2  3  5  8 13 21

Solution

\((0, 1, 1, 2, 3, 5, 8, 13, 21)\). (If you added the “Fibonacci numbers” to your answer, that would impress graders.)

4

State the basic idea of gradient descent in a few sentences.

Suppose \(f(x,y) = 2x^2 + y^2\). First, write the general formula for minimisation. Then, given \((x_0,y_0) = (1,1)\) and step size \(\alpha=0.1\), compute the first two steps of gradient descent for minimisation.

Solution

In maximisation problems, since the gradient provides information about a direction of fastest increase in the objective function, the method iteratively takes a small step in that direction to reach a local maximum. In general, the step size should be small because the gradient and hence the direction of fastest increase may change after each step.

\[\begin{gather} (x_{n+1},y_{n+1}) = (x_{n},y_{n}) - \alpha \nabla f(x_{n},y_{n})\\ \nabla f(x,y) = (4x, 2y)\\ (x_1,y_1) = (1,1) - 0.1\times(4,2) = (0.6,0.8)\\ (x_2,y_2) = (0.6,0.8) - 0.1\times(2.4,1.6) = (0.36,0.64) \end{gather}\]

(Of course, you want to put a bit more words to explain your work.)

5

Apply the line search to the same \(f\) and \((x_0,y_0)\) above to compute \((x_1,y_1)\). How about \(f(x,y)=x^2+y^2\)?

Solution

\[\begin{gather} \alpha\nabla f(1,1) = (4\alpha, 2\alpha)\\ \underset{\alpha}{\operatorname{argmin}} f(1-4\alpha, 1-2\alpha) = \underset{\alpha}{\operatorname{argmin}} 2(1-4\alpha)^2 + (1-2\alpha)^2 = \frac{5}{18}.\\ (x_1,y_1) = (1,1) - \frac{5}{18}(4,2) = (-1/9,4/9). \end{gather}\]

In the second case, notice \(\nabla f(x,y) = 2(x,y)\) — the gradient is a scalar multiple of the coordinates. So, we can reach the minimum \(f(0,0)=0\) from anywhere \((x,y)\) with \(\alpha = 0.5\). Hence, \((x_1,y_1)=(0,0)\).