Day22: Splitting data, calculating metrics, cross-validations

Today I want to showcase “better” and more tidy ways of using Python.

The Jupyter Notebook for this little project is found here.

Splitting the data into testing and training sets

Originally, I would slice the data by the index. The problem is that the slicing is not random.

train_num = int(len(X) * 0.8)

X_train = X[:train_num]
X_test  = X[train_num:]

y_train = y[:train_num]
y_test  = y[train_num:]

Scikit-learn has a function train_test_split() to randomly split the data into test and training sets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)

Calculating prediction accuracy

Originally, I would manually compute accuracy. The problem is that is it takes more thought to read.

# Number of correct predictions by total predictions
sum(clf.predict(X_test) == y_test)/len(y_test)

Using Scikit-learn, there are alternative ways to do the same calculation.

# Directly from the trained classifier
clf.score(X_test, y_test)

# Using the metrics method of Scikit-learn
from sklearn import metrics
metrics.accuracy_score(y_test, clf.predict(X_test))

Computing CV metrics

Yesterday we showed how to slit the data into folds for cross-validation using KFold() and LeaveOneOut(). Alternatively, we could use the cross_val_score() function to “evaluate a score by cross-validation”.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X, y.values.ravel(), cv=3)
# clf is unfitted eg. clf = svm.SVC(kernel='linear', C=1)

# Print scores
print("The scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

NB: Different “scoring” metrics can be used

Getting CV predictions

Similary, we can use the cross_val_predict() to “generate cross-validated estimates for each input data point”. In other words, to get the predicted result for each test set of the cross-validation.

from sklearn.model_selection import cross_val_predict

predicted = cross_val_predict(clf, X, y.values.ravel(), cv=3)

For an example, see here.