In machine learning applications, we always want to divide our dataset into three separate subsets: a training set, validation set, and testing set.

  • Training sets are used to train the model and learn parameters (like neuron weights).
  • Validation sets are used to tune hyperparameters and are constructed out of the training set with examples that the training algorithm does not observe.
  • Testing sets are compared against the trained model to assess its performance.

One key idea is that the testing set should only be tested sparingly. If we keep training our model against the testing set, we have essentially combined the testing/training set and will overtrain on the testing set.

The validation set is distinct from the training set because hyperparameters trained on the training set would effectively “always choose the maximum possible model capacity”1 leading to overfitting on the training set. It’s also constructed from the training data because the test data shouldn’t be used to make choices about the model (same problem as above).

Numbers

We try to have a 80/20 split for training and testing. 20% of the training data is used for validation.

In code

For traditional statistical models in scikit-learn, we can use:

from sklearn.model_selection import test_train_split

It has the following inputs:

  • Any arrays inputted are split, including lists, NumPy arrays, and pandas dataframes. Say we have x as the data and y as the labels.
  • test_size is a value between 0 and 1, which specifies the proportion of data in the testing set.
  • stratify takes an array and allows us to evenly proportion the split for all classes. The array we pass in should be y (or a labels array).

Footnotes

  1. From Deep Learning by Goodfellow, Bengio, Courville, and Bach.