Layer normalisation is a variant of batch normalisation that functions similarly to BN but for a minibatch of size 1. This functions well for convolutional neural networks, because they operate on data with a grid topology, so it can still average across all locations of the grid.

We define LN as:

And we define the mean as:

And the standard deviation as (with an offset to prevent division by 0):

We use LN because it prevents divergence of the model because the output of LN is scale-independent. It also doesn’t depend on the minibatch size and if we’re doing training or testing.

It’s just a transformation that standardises the activations to a given scale.