A linear activation function is described as:

i.e., for , the input of a neuron is nearly fully passed onto the next layer (if bias is 0). What this means is that for an -dimensional input space, our linear function produces an -dimensional hyperplane.

Visually, we could think about a linear activation function dividing the input set into two spaces, i.e., for a classification problem.

Why don’t we want linear functions?

  • Most datasets aren’t linearly separable, i.e., there’s no line that can separate the input space.
  • We want multiple non-linear transformations in our layer to rectify this.
  • There’s no advantage from multiple linear layers.
    • All layers become a single massive linear layer.
    • Think about this: for many inputs, they’re all essentially sent through with few changes (sans a small bias change).