Introduction

Bayesian optimisation (BO) is a sequential sampling technique for global optimisation. It has two appealing features: ability to handle complex objective functions and sample efficiency. The basic machinery of BO consists of:

These two components dynamically interact with each other to achieve sample efficient global optimisation. Consequently, BO is particularly useful when function evaluations are expensive and the gradient information is not available.

This sequential “sampling technique” is different from Monte Carlo sampling methods, which are about how to sample from specific distributions. Here, some stochastic process is given, and we are assumed to have access to a sampler for that stochastic process no matter what input we use. The problem is what input from the index set to use. Imagine a noisy objective function, which is modelled as a stochastic process. In other words, the problem is where to evaluate the objective function.

Global optimisation

Recall from the optimisation notes, global optimisation can be formulated as solving the following problem \[\min_{x\in\mathcal{D}} f(x)\] where \(f\) is the objective function and \(\mathcal{D}\) is the choice set in which we try to find a minimiser \(x^*\) such that \(f(x^*)\leq f(x)\) for all \(x\in\mathcal{D}\).

In these notes, we use minimisation to formulate optimisation problems.

The fundamental challenge arises from our limited knowledge of the shape of \(f\). Unless \(f\) is fairly simple such as low-degree polynomials and sinusoids, it is usually hard to get a global picture from the expression of \(f\) even in 1-D cases.

Most of the times, therefore, we evaluate \(f\) at a finite number of points and interpolate in-betweens. Using the same test function as in the optimisation notes, for example, the following graph is produced by first evaluating at 101 evenly spaced points from 0.26 to 3.6 and then connecting dots. The global minimum is approximately \(-1.025\) at \(x=2.5\).

With this graph, we can easily pick the best one (68th point) among 101 points as an approximate minimiser \(\hat{x}^*=2.498\) yielding \(f(\hat{x}^*)=-1.025\), which is essentially the global minimum.

Comparing to this simple yet effective approach, we notice something naïve in gradient descent; it does not utilise information that can be extracted from visited/evaluated points. Instead, it ignores the history but uses only local information (i.e, gradient) and blindly follows the direction at each step. As a result, if using random initial points to find the global minimum, we must repeat the process at least five times (very lucky) in the above case

Surrogate models

Seeing those evaluated points as data, don’t you think the above interpolation is akin to statistical regression? In regression, we collect data \((X,\mathbf{y})\) and estimate the underlying data-generating process by fitting some models to the data. Above, using the evenly-spaced points \(X\) and the corresponding objective values \(\mathbf{y}=f(X)\), we essentially estimate the underlying objective function by connecting dots.

When we do regression, our models are more sophisticated than connecting data points (e.g., linear, polynomial, neural network, etc.). So, why not using statistical models for optimisation as well? It is reasonable to expect that suitable models could extract more information from evaluated points about the objective function and, in particular, promising regions of minimisers, leading to more efficient search. These models are commonly called surrogates.

Connecting dots can be seen as a special case of spline or local regression.

Gaussian process

Definition

Gaussian process (GP) is a collection of random variables (i.e., stochastic process) such that any finite sub-collection follows a multivariate normal distribution. GP is the most popular surrogate models in Bayesian optimisation.

Remember the Poisson process is a stochastic process specified by rate \(\lambda\). For a counting process, we use a set of nonnegative real numbers \(\mathbb{R}_+\) as an index set, with which for some epoch \(t\in\mathbb{R}_+\) of interest, we have a counting random variable \(N(t)\sim\text{pois}(\lambda t)\).

You may also find GP classified as Bayesian nonparametric in statistics and machine learning.

To satisfy the above condition, a GP is specified by a mean function \(m\) and covariance function \(k\). We write \(Y\sim\mathcal{GP}(m,k)\) to denote that a stochastic process \(Y\) follows a GP specified by \(m\) and \(k\). As the names suggest, the roles of \(m\) and \(k\) are as follows. For inputs \(x\) and \(x'\) of the index set, \[\begin{align} m(x) &= \mathbb{E}[Y(x)]\\ k(x,x') &= \mathbb{E}[(Y(x)-m(x))(Y(x')-m(x'))]. \end{align}\]

Intuitively, \(k(x,x')\) measures “distance” between \(x\) and \(x'\) reflecting similarity between \(Y(x)\) and \(Y(x')\). If two inputs \(x\) and \(x'\) are close, the corresponding random variables \(Y(x)\) and \(Y(x')\) should be strongly correlated.

For concreteness, suppose the index set is the real line \(\mathbb{R}\). Then, if we are interested in a collection of two random variables at \(\mathbf{x}=(x_1,x_2)\), then a random vector \[Y = \begin{bmatrix} Y(x_1)\\ Y(x_2) \end{bmatrix} \] is jointly distributed as \(\mathrm{N}(\boldsymbol{\mu},\Sigma)\) where \[ \boldsymbol{\mu} = \begin{bmatrix} m(x_1)\\ m(x_2) \end{bmatrix} \quad\text{and}\quad \Sigma = \begin{bmatrix} k(x_1,x_1) & k(x_1,x_2)\\ k(x_2,x_1) & k(x_2,x_2) \end{bmatrix}. \]

Remember the property of Normal distribution. To have the joint PDF, in practice, we define \(k\) such that \(\Sigma\) is positive definite.

Demo

Let’s see what GP realisation looks like, using one of the most basic configurations: \(m(x) = 0\) and \(k(x,x') = \exp(-\|x-x'\|^2)\).

k <- function(x1,x2) exp(-(x1-x2)^2)

xx <- seq(.26, 3.6, length.out=101)
K <- outer(xx, xx, k) # 101×101 covariance matrix
Y <- mvrnorm(3, rep(0,m), K)
matplot(xx, t(Y), type="o", lwd=1.2, pch=20, cex=0.5, xlab="x", ylab="y")
title("Random functions from GP")

As clearly indicated in the above code, we drew three length-101 random vectors from the multivariate normal and plotted each set of 101 realisations separately. Since the points in each series are connected, they look like (random) functions.

GP regression

Suppose that a normal random vector is partitioned into \(Y_1\) and \(Y_2\) as follows: \[ \begin{bmatrix} Y_1\\ Y_2 \end{bmatrix} \sim \mathrm{N}(0,\Sigma), \] where \[\Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} . \] Then, the conditional distribution of \(Y_1\) given \(Y_2=\mathbf{y_2}\) is also normal \[Y_1|\mathbf{y_2} \sim \mathrm{N}(\bar{\boldsymbol{\mu}},\bar{\Sigma}),\] where \[ \bar{\boldsymbol{\mu}} = \Sigma_{12}\Sigma_{22}^{-1}\mathbf{y_2} \quad\text{and}\quad \bar{\Sigma} = \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}. \]

Based on the above property of multivariate normal distribution, in GP regression, we predict the distribution of \(Y(x)\) at location \(x\) using \(n\) data points \(D_n = (X_n,\mathbf{y}_n) = \{(x_i,y_i)\}_{i=1}^n\). That is, first, the joint distribution of \(Y(x)\) and \(Y_n\) is \[ \begin{bmatrix} Y(x)\\ Y_n \end{bmatrix} \sim \mathrm{N}(0,K), \] where \[K = \begin{bmatrix} k(x,x) & \mathbf{k}(x,X_n)\\ \mathbf{k}(X_n,x) & \mathbf{k}(X_n,X_n) \end{bmatrix}, \] and \(\mathbf{k}\) denotes a matrix of appropriate size whose elements are obtained by applying \(k\) separately for each element.

\(\mathbf{k}(x,X_n)\) is in \(1\times n\), \(\mathbf{k}(X_n,x)\) is in \(n\times 1\), and \(\mathbf{k}(X_n,X_n)\) is in \(n\times n\).

Then, the conditional distribution of \(Y(x)\) given \(Y_n=\mathbf{y}_n\) sampled at \(X_n\) is \[Y(x)|D_n \sim \mathrm{N}(\hat{\mu},\hat{K}),\] where \[\begin{align} \hat{\mu}(x) &= \mathbf{k}(x,X_n)\mathbf{k}(X_n,X_n)^{-1}\mathbf{y}_n\\ \hat{K}(x) &= k(x,x) - \mathbf{k}(x,X_n) \mathbf{k}(X_n,X_n)^{-1} \mathbf{k}(X_n,x). \end{align}\]

Observe that both \(\hat{\mu}\) and \(\hat{K}\) are functions of \(x\). So, we may treat \(\hat{\mu}\) as prediction of \(Y\) at \(x\) (i.e., a regression function) and \(\hat{K}\) as prediction variance.

Demo

Suppose we have evaluated the above objective function at evenly spaced 15 points from 0.26 to 3.5. The following graph traces \(\hat{\mu}\) computed conditional on these data points.

Under this specific setting, with only 15 samples, we can estimate the function remarkably well.

Acquisition function

Efficient sequential sampling

BO is often used when function evaluations are expensive; otherwise, we could simply search over a fine grid just like the above case at the beginning. So, in these cases, we want to locate a good minimiser within an evaluation budget or simply as quickly as possible.

A story may be oil drilling. Before full production takes place, a company has to discover a good spot through test drilling. Suppose drilling a single hole and examining potential oil content under the ground cost some million dollars. So, the company wants to identify a good location conducting as few tests as possible. We may model the problem as follows:

  • Index set. A testing site as a subset of \(\mathbb{R}^2\).
  • Objective function. A probability that a location \((x,y)\) contains a profitable amount of oil.
  • Evaluation of the objective. Drilling.

To understand the task of efficient sampling, let’s sequentially plot both estimated mean function and 2-standard deviation band for increasing sample sizes.