Day21: Cross-validation

Today I want to use cross-validation, a technique used to prevent overfitting on the test set.

Cross-validation is a technique used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited. In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate. (Shane, 2010)

Reference: Cross-validation: evaluating estimator performance (Scikit-learn)

2 types of Cross-validation (CV)

k-fold CV
leave-one-out CV

The following figure shows 5-fold CV:

Image by ProClassify User’s Guide - Cross-Validation Explained. Available at: http://genome.tugraz.at/proclassify/help/pages/XV.html. Accessed March 17, 2017.

k-fold

In k-fold CV, the data is divided into k folds. Training is done on k-1 folds and testing is done on the remaining fold. Afterwards, testing (and training) is done on another fold.

from sklearn.model_selection import KFold

data = range(9)

kf = KFold(n_splits=3)
for train_index, test_index in kf.split(data):
    print("TRAIN: {}\tTEST: {}".format(train_index, test_index))

# Index of samples for the train and test set

TRAIN: [3 4 5 6 7 8]	TEST: [0 1 2]
TRAIN: [0 1 2 6 7 8]	TEST: [3 4 5]
TRAIN: [0 1 2 3 4 5]	TEST: [6 7 8]

leave-one-out (LOO)

LOO CV is like k-fold CV, but the size of the fold being left out is size 1 sample.

from sklearn.model_selection import LeaveOneOut

data = range(9)

loo = LeaveOneOut()
for train_index, test_index in loo.split(data):
    print("TRAIN: {}\tTEST: {}".format(train_index, test_index))

# Index of samples for the train and test set

TRAIN: [1 2 3 4 5 6 7 8]	TEST: [0]
TRAIN: [0 2 3 4 5 6 7 8]	TEST: [1]
TRAIN: [0 1 3 4 5 6 7 8]	TEST: [2]
TRAIN: [0 1 2 4 5 6 7 8]	TEST: [3]
TRAIN: [0 1 2 3 5 6 7 8]	TEST: [4]
TRAIN: [0 1 2 3 4 6 7 8]	TEST: [5]
TRAIN: [0 1 2 3 4 5 7 8]	TEST: [6]
TRAIN: [0 1 2 3 4 5 6 8]	TEST: [7]
TRAIN: [0 1 2 3 4 5 6 7]	TEST: [8]