Batch normalisation is a popular and effective technique to accelerate the convergence of deep neural networks. The premise of batch normalisation relies on finding the vector norm for zero mean per layer. It’s applied to individual layers (or optionally to all of them) and can enable deeper models (with more layers).
In each iteration, the inputs are normalised by subtracting their arithmetic mean and dividing by their standard deviation. We also multiply by a scale factor . For a minibatch , and input , we define BN mathematically as:
Note: these parameters are different from other parts of the model (i.e., those of the optimisers). These are specific to the batch normaliser.
The minibatches in BN are a subset of the data points that we select arbitrarily.
In model architectures
In fully-connected networks, we apply BN right after the affine input linear transformation and before the activation function is used. We cannot apply BN with minibatches of size 1, since we wouldn’t be able to learn anything, because each hidden unit would become 0 after subtracting the mean.
In convolutional neural networks, we can apply BN right after the convolution but before the non-linear activation function. The difference from a fully-connected model is that we apply the BN on a per-channel basis across all locations (because CNNs are translation-invariant, the specific location of a pattern is not critical for our understanding).
In CNNs, BN is well-defined for batches of size 1 (because data has a grid topology to average), which motivates layer normalisation.