Gradient descent is a general practice in many algorithms. It’s premised on the idea that the derivative of a function is a good way to minimise or maximise it. We approximately have:

The implication of this is that we can reduce by moving in small steps with the opposite sign of the derivative. Gradient descent converges when every element of the gradient is 0. Note that gradient descent is greedy by nature, so it is likely it’ll find a local optimum but not a global optimum.

In the multivariable case, we rely on the directional derivative to minimise the function, since it gives the “direction” of the rate of change, such that we have a new point:

We define (sometimes also denoted ) as the learning rate, which is usually a small constant in practical applications.

Some different variants of gradient descent:

  • Line searches chooses multiple values of and chooses the one that results in the smallest objective.
  • Hill climbing is gradient descent/ascent in a discrete space.
    • Simulated annealing is a variant of stochastic hill climbing, where bad perturbations are accepted with some probability.
  • Stochastic gradient descent allows gradient evaluation at a random point.

Machine learning

In machine learning models, we minimise the cost function in small steps controlled by the learning rate. In many cases, we want a smaller value for the gradient vector.

Basically if we adjust the weights according to the slope, we end up getting the minimum (or maximum error). We use this equation to adjust the weights

where is the learning rate (or step size). I think the partial derivative essentially comes from backprop.

Because deep neural networks have millions or billions of parameters, gradient descent occurs in a million-dimensional space.

Batching

We can also apply “batching” instead of working with one sample at a time. In this process, we do the following steps:

  • Use our network to make the predictions for samples.
  • Compute the average loss for those samples.
  • Take a “step” to optimise the average loss of those steps.

Some parameters of this process include the “batch size” (number of training examples), the number of “epochs” (each epoch is when all the training data is used once to update the parameters, including the weights), and the number of “iterations” (where all parameters are updated once per iteration).

For example, if there are 1000 training samples and we set batch size to 10, then 1 epoch will have 100 iterations.

A key problem: how do we pick the batch size?

  • If the batch size is too small, we optimise a possibly very different function loss at each iteration. This results in noise.
  • If the size is too large, this is expensive and the average loss might not change very much as the batch size grows.

Mathematically

In a multidimensional space, many points where the gradient evaluates to zero are saddle points, where there is a local maxima wrt one dimension and a local minima wrt another. Sometimes the search space can also include plateaus. This is avoidable with some specialised variants of gradient descent.

One problem with the gradient is that its time complexity for each iteration is , which can be prohibitively expensive for larger training data sets. This motivates the idea of stochastic gradient descent as a way of reducing computational cost.