Introduction

Monte Carlo

Monte Carlo method is a class of algorithms that draw random numbers and compute their statistics to approximate quantities of interest. The method uses randomness even to solve deterministic problems, for which deterministic algorithms are too inefficient to be practical.

Stanislaw Ulam was one of the contributors who developed the method in the mid 20th Century. The method was named after Ulam’s uncle who just had to go Monte Carlo to gamble.

Monte Carlo integration is the most common application of the method.

Key ideas

Recall that, for many functions, it is impossible to carry out analytical integration, which is the motivation for Numerical Integration.

Deterministic methods such as the trapezoidal rule and Simpson’s rule have their pros and cons. They work pretty well in one dimension but their performance rapidly deteriorates as the dimensions increase.

Monte Carlo (MC) integration is an alternative approach to high-dimensional problems. Despite the poor performance in one dimension, it tends to work better in high dimensions than deterministic methods.

The foundation of MC integration is the law of large number. In words, for reasonable \(X\) and \(\phi\), the sample average of \(\{\phi(X_i)\}\) converges to \(\mathbb{E}\phi(X)\) as \(n\to\infty\).

How can this help with difficult integration? Remember that the mean of a continuous random variable \(X\) is defined as integral: \[\mathbb{E}X = \int_{-\infty}^\infty xf(x)dx .\] Therefore, expressing the difficult integral in terms of the expectation of some random variable \(X\), for which we can collect lots of samples, we can use the sample average to approximate that integral.

“we can collect lots of samples” is a catch and addressed in Sampling Methods. But, for now, let’s assume we can.

Suppose we want to integrate \(\phi(x)\) over the range \(\mathcal{X}\). Let \(f\) be the PDF of \(X\) with the support \(\mathcal{S}\) such that \(\mathcal{X}\subseteq\mathcal{S}\). Then,

\[\begin{align} \int_\mathcal{X}\phi(x)dx &= \int_\mathcal{S}\phi(x)\mathbb{1}_{\mathcal{X}}(x)dx\\ &= \int_\mathcal{S} \frac{\phi(x)\mathbb{1}_{\mathcal{X}}(x)}{f(x)}f(x)dx\\ &= \mathbb{E}\left[\frac{\phi(X)\mathbb{1}_{\mathcal{X}}(X)}{f(X)}\right] \end{align}\]

We can approximate this expectation by the sample mean estimator \(\hat{M}\): \[\hat{M} = \frac{1}{n}\sum_{i=1}^n \frac{\phi(X_i)\mathbb{1}_{\mathcal{X}}(X_i)}{f(X_i)} \approx \mathbb{E}\left[\frac{\phi(X)\mathbb{1}_{\mathcal{X}}(X)}{f(X)}\right].\]

Note that \(\hat{M}\) is an unbiased estimator, meaning \[\mathbb{E}\hat{M} = \mathbb{E}\left[\frac{\phi(X)\mathbb{1}_{\mathcal{X}}(X)}{f(X)}\right].\]

If \(\mathcal{X}=\mathcal{S}\), of course, we can drop \(\mathbb{1}_{\mathcal{X}}\) everywhere and have cleaner expressions.

As you may notice, in a sense, the indicator function \(\mathbb{1}_{\mathcal{X}}(x)\) checks \(x\)’s membership in \(\mathcal{X}\) — a member \((1)\) or not \((0)\).

The judicious use of indicator function \(\mathbb{1}_{\mathcal{X}}\) is powerful, allowing us to formulate many different quantities as expectations. Even a probability can be expressed as an expectation: \[ P(X\in \mathcal{X}) = \int_\mathcal{X}f(x)dx = \int_{-\infty}^\infty \mathbb{1}_{\mathcal{X}}(x)f(x)dx = \mathbb{E}\left[\mathbb{1}_{\mathcal{X}}(X)\right] .\]

Inferior performance in 1-D

Let’s compare MC integration with some of the deterministic methods introduced in Numerical Integration.

The example problem is to integrate \(\phi(x) = x\sin(x)+5\) over \(\mathcal{X}=[0,3\pi]\), for which we know the answer is \(18\pi \approx 56.5486677646\). Remember that, with only \(n=100\) equally-spaced evaluation points, the trapezoidal rule has \(56.54169\) and Simpson’s rule has \(56.54867\).

We can, in principle, use any continuous distribution as long as its support covers \(\mathcal{X}\). We experiment here \(X\sim U(0,10)\) and \(X\sim\text{exp}(1)\) with \(n=10^5\).

phi <- function(x) return(x*sin(x)+5)
n <- 1e5

First, \(X\sim U(0,10)\) for which the PDF is \(f(x)=1/10\) for \(x\in[0,10]\) and \(0\) otherwise.

x <- runif(n,0,10)
x <- x[x<3*pi]
10*sum(phi(x))/n
## [1] 56.59012

Next, \(X\sim\text{exp}(1)\) for which the PDF is \(f(x)=e^{-x}\) for \(x\geq0\) and \(0\) otherwise.

x <- rexp(n)
x <- x[x<3*pi]
sum(phi(x)*exp(x))/n
## [1] 55.32569

As you can see, the results tend to be inferior even with \(n=10^5\), especially \(\text{exp}(1)\). Well, this is expected; what’s the point of using \(X\) with the support of \([0,\infty)\) to compute the integral of \(\phi(x)\) over \([0,3\pi]\)!? We will come back to this inefficiency to motivate Importance sampling below.

2-D problems

To simplify problems and highlight key features, let’s consider the uniform distribution over some square \([a,b]^2\subset\mathbb{R}^2\), which is also the integration range, i.e., \(\mathcal{X} = \mathcal{S} = [a,b]^2\).

By the uniform distribution over \(\mathcal{S}\in\mathbb{R}^d\), I mean the PDF of \(X\) is \[ f(x) = \begin{cases} \frac{1}{|\mathcal{S}|} & x\in\mathcal{S}\\ 0 & \text{otherwise} \end{cases} \] where \(|\mathcal{S}| = \int_\mathcal{S}1dx\), the volume of \(\mathcal{S}\). It is a normalisation constant to make the PDF integrates to \(1\).

For a quick review, see Double Integrals over General Regions

Estimating \(\pi\)

A canonical problem is to use MC integration to estimate \(\pi\). Equivalently, define \(\phi(x) = \mathbb{1}_\mathcal{D}(x)\) where \[ \mathcal{D} = \{x\in\mathbb{R}^2 : \|x\|\leq1, x\geq0\} ,\] which is a quarter disk in the positive orthant whose volume (i.e., area) is \(\pi/4\). Using the support \(\mathcal{S} = \mathcal{X} = [0,1]^2\) with \(|\mathcal{S}|=1\), we have \[\begin{align} \frac{\pi}{4} &= \int_\mathcal{X}\phi(x)dx\\ &= \int_\mathcal{S} \mathbb{1}_\mathcal{D}(x)dx\\ &= \int_\mathcal{S} \frac{\mathbb{1}_\mathcal{D}(x)}{f(x)}f(x)dx\\ &= \int_\mathcal{S} |\mathcal{S}|\mathbb{1}_\mathcal{D}(x)\frac{1}{|\mathcal{S}|}dx\\ &= \mathbb{E}\left[|\mathcal{S}|\mathbb{1}_{\mathcal{D}}(X)\right]\\ &= \mathbb{E}\left[\mathbb{1}_{\mathcal{D}}(X)\right] \end{align}\]

The idea is to use MC integration to approximate the last expectation, which in turn tells us \(\pi\).

n <- 20000
x <- c()
y <- c()
for (i in 1:n) {
  xy <- runif(2)
  if (sum(xy^2)<=1) {
    x <- c(x, xy[1])
    y <- c(y, xy[2])
  }

  if (i%%1e3==0) {
    plot(x, y, xlim=c(0,1), ylim=c(0,1), ann=FALSE, pch=19, cex=0.1)
    title(paste("n = ", i, ", ", "π = ", round(4*length(x)/i,5), sep=""))
  }
}