Gradient problems

In deep learning, two core problems many model architectures face are the vanishing gradient problem and the exploding gradient problem. The core idea: gradients are backpropagated through the model. Because of numerical instabilities, they’ll either explode very large or vanish very small through the model.

This is a big architectural problem many models face.

Problems: very deep models often fail because of exploding and vanishing gradients. This is because we keep multiplying with many intermediate steps.

Mitigation steps

Some key ways we can avoid exploding/vanishing gradients is:

Parameter initialisation: carefully deciding initial values for the network’s architectural values can determine how deep the model can go.
Activation function: different activation functions can propagate the gradients differently. For instance, $tanh$ effectively sets the gradient to 0 if the value is outside a narrow range around $z = 0$ . $ReLU$ is intended as a non-linear activation that avoids this.
Normalisations: prevent explosions by normalising values, with batch and layer normalisation.

jszhn

Recent Notes

Accounting method

Adjugate matrix

Algorithm

Algorithmic analysis

Alma Linux

Gradient problems

Mitigation steps

Graph View

Backlinks