Cross entropy (CE) is an error function mostly used for classification problems. It’s given by:

where:

  • is the number of training samples.
  • is the number of classes.
  • is the ground truth label.
  • is the prediction.

Does the base matter? No, not really. Historically base 2 has been used because of origins in information theory, where information is stored in bits and bytes.

This is actually just negative log likelihood if we divide by the number of samples.

We can implement this in PyTorch with torch.nn.CrossEntropy().

Binary cross-entropy

For a Bernoulli distribution, i.e., where we have a binary classification problem, we can use a specialised case of CE called binary cross entropy (BCE).

where the term and relate to the corresponding class in the Bernoulli distribution.

In PyTorch, we use torch.nn.BCELoss(). A variation on this is torch.nn.BCEWithLogitsLoss(), which is more numerically stable and combines a sigmoid layer with BCE.