In deep learning, two core problems many model architectures face are the vanishing gradient problem and the exploding gradient problem. The core idea: gradients are backpropagated through the model. Because of numerical instabilities, they’ll either explode very large or vanish very small through the model.

This is a big architectural problem many models face.

Problems: very deep models often fail because of exploding and vanishing gradients. This is because we keep multiplying with many intermediate steps.

Mitigation steps

Some key ways we can avoid exploding/vanishing gradients is:

  • Parameter initialisation: carefully deciding initial values for the network’s architectural values can determine how deep the model can go.
  • Activation function: different activation functions can propagate the gradients differently. For instance, effectively sets the gradient to 0 if the value is outside a narrow range around . is intended as a non-linear activation that avoids this.
  • Normalisations: prevent explosions by normalising values, with batch and layer normalisation.