Today I want to use cross-validation, a technique used to prevent overfitting on the test set.
Cross-validation is a technique used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited. In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate. (Shane, 2010)
- Reference: Cross-validation: evaluating estimator performance (Scikit-learn)
2 types of Cross-validation (CV)
- k-fold CV
- leave-one-out CV
The following figure shows 5-fold CV:
- Image by ProClassify User’s Guide - Cross-Validation Explained. Available at: http://genome.tugraz.at/proclassify/help/pages/XV.html. Accessed March 17, 2017.
k-fold
In k-fold CV, the data is divided into k folds. Training is done on k-1 folds and testing is done on the remaining fold. Afterwards, testing (and training) is done on another fold.
from sklearn.model_selection import KFold
data = range(9)
kf = KFold(n_splits=3)
for train_index, test_index in kf.split(data):
print("TRAIN: {}\tTEST: {}".format(train_index, test_index))
# Index of samples for the train and test set
TRAIN: [3 4 5 6 7 8] TEST: [0 1 2]
TRAIN: [0 1 2 6 7 8] TEST: [3 4 5]
TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]
leave-one-out (LOO)
LOO CV is like k-fold CV, but the size of the fold being left out is size 1 sample.
from sklearn.model_selection import LeaveOneOut
data = range(9)
loo = LeaveOneOut()
for train_index, test_index in loo.split(data):
print("TRAIN: {}\tTEST: {}".format(train_index, test_index))
# Index of samples for the train and test set
TRAIN: [1 2 3 4 5 6 7 8] TEST: [0]
TRAIN: [0 2 3 4 5 6 7 8] TEST: [1]
TRAIN: [0 1 3 4 5 6 7 8] TEST: [2]
TRAIN: [0 1 2 4 5 6 7 8] TEST: [3]
TRAIN: [0 1 2 3 5 6 7 8] TEST: [4]
TRAIN: [0 1 2 3 4 6 7 8] TEST: [5]
TRAIN: [0 1 2 3 4 5 7 8] TEST: [6]
TRAIN: [0 1 2 3 4 5 6 8] TEST: [7]
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8]