Softmax is a function that serves as the final output layer’s activation function in a multiclass classification problem.

What does softmax do? It normalises the logits (the raw outputs of the NN) into a discrete probability distribution of all possible classes.

i.e., for a logit with possible classes, we run softmax times.

Temperature scaling

Softmax temperature scaling helps solve over-confidence in neural networks by scaling the input logits to the softmax with a temperature.

  • A low temperature has larger logits with more confidence. It generates higher quality samples with less variety.
  • A high temperature has smaller logits with less confidence. It has the opposite.
  • Conceptually, the temperature is similar to the idea of simulated annealing. A high temperature indicates large levels of exploration compared to a low temperature. Low temperature = stable changes/exploration.

For a generative RNN, a high temperature means larger chance of nonsense outputs. A low temperature means relatively stable and sensical outputs.