Layer normalisation is a variant of batch normalisation that functions similarly to BN but for a minibatch of size 1. This functions well for convolutional neural networks, because they operate on data with a grid topology, so it can still average across all locations of the grid.
We define LN as:
And we define the mean as:
And the standard deviation as (with an offset to prevent division by 0):
We use LN because it prevents divergence of the model because the output of LN is scale-independent. It also doesn’t depend on the minibatch size and if we’re doing training or testing.
It’s just a transformation that standardises the activations to a given scale.