A linear activation function is described as:
i.e., for , the input of a neuron is nearly fully passed onto the next layer (if bias is 0). What this means is that for an -dimensional input space, our linear function produces an -dimensional hyperplane.
Visually, we could think about a linear activation function dividing the input set into two spaces, i.e., for a classification problem.
Why don’t we want linear functions?
- Most datasets aren’t linearly separable, i.e., there’s no line that can separate the input space.
- We want multiple non-linear transformations in our layer to rectify this.
- There’s no advantage from multiple linear layers.
- All layers become a single massive linear layer.
- Think about this: for many inputs, they’re all essentially sent through with few changes (sans a small bias change).