Computation of the gradient is often a non-trivial task in modern optimisation problems.

Recall the gradient descent algorithm for minimising an objective function \(f:\mathbb{R}^d\to\mathbb{R}\): \[ x_{n+1} = x_n - \alpha \nabla f(x_n).\] To implement the algorithm, we need to compute the gradient \(\nabla f(x_n)\) every time we take a step \(\alpha \nabla f(x_n)\) from \(x_n\) to \(x_{n+1}\).

While mathematically it is simply a partial derivative \(\frac{\partial f}{\partial x_i}(x_n)\) for \(i\in\{1,\dots,d\}\). As long as we have this partial derivative function implemented in R for all \(i\), we might just call each of \(d\) functions at every step \(n\in\{1,2,\dots,N\}\), and we reach a local minimum after \(N\) steps. In practice, this naÃ¯ve idea can easily become **infeasible**.

First, the required number of steps \(N\) depends on models, the step size \(\alpha\), and quality of initial point \(x_0\). Even in simple 2-D problems \((d=2)\), \(N\) can be thousands.

Second, in modern optimisation problems, it is often the case that \(d\) is quite large, and \(\nabla f\) comprises \(d\) of complicated functions. Even a single evaluation of such \(\nabla f\) may take non-trivial time.

To get a sense of large-scale problems, repeat a simple operation

`sum <- sum + (-1)^i`

a billion times`(i in 1:1e9)`

.

In statistics and machine learning, optimisation problems often arise from estimation of model parameters. For example, in deep learning, it is common to have billions of parameters \((d>10^9)\) that define a deep neural network. Estimation of those parameters takes days or even weeks.

In a regression problem, we often have the following minimisation problem: \[\min_{w\in\mathbb{R}^d} \mathcal{C}(w) := \min_{w\in\mathbb{R}^d} \sum_{j=1}^m(y_j-f(x_j;w))^2,\] where \(f\) is a statistical model parameterised by \(w\), and \(\{(x_j,y_j)\}_{j=1}^m\) is data to which we fit \(f\) by solving the above minimisation problem.

To use the gradient descent \[ w_{n+1} = w_n - \alpha \nabla \mathcal{C}(w_n),\] we need to compute \(\nabla \mathcal{C}(w_n)\) at every step \(n\in\{1,2,\dots,N\}\).

Letâ€™s use a simple neural network to understand common challenges of computing the gradient and a workaround. In standard regression problems, a neural network is a real-valued function \(f:\mathbb{R}^k\to\mathbb{R}\) **composed in a special way**. The composition is typically illustrated using a network diagram as below.