In deep learning, two core problems many model architectures face are the vanishing gradient problem and the exploding gradient problem. The core idea: gradients are backpropagated through the model. Because of numerical instabilities, they’ll either explode very large or vanish very small through the model.
This is a big architectural problem many models face.
Problems: very deep models often fail because of exploding and vanishing gradients. This is because we keep multiplying with many intermediate steps.
Mitigation steps
Some key ways we can avoid exploding/vanishing gradients is:
- Parameter initialisation: carefully deciding initial values for the network’s architectural values can determine how deep the model can go.
- Activation function: different activation functions can propagate the gradients differently. For instance, effectively sets the gradient to 0 if the value is outside a narrow range around . is intended as a non-linear activation that avoids this.
- Normalisations: prevent explosions by normalising values, with batch and layer normalisation.