Today I want to showcase “better” and more tidy ways of using Python.
The Jupyter Notebook for this little project is found here.
Splitting the data into testing and training sets
Originally, I would slice the data by the index. The problem is that the slicing is not random.
train_num = int(len(X) * 0.8)
X_train = X[:train_num]
X_test = X[train_num:]
y_train = y[:train_num]
y_test = y[train_num:]
Scikit-learn has a function train_test_split()
to randomly split the data into test and training sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)
Calculating prediction accuracy
Originally, I would manually compute accuracy. The problem is that is it takes more thought to read.
# Number of correct predictions by total predictions
sum(clf.predict(X_test) == y_test)/len(y_test)
Using Scikit-learn, there are alternative ways to do the same calculation.
# Directly from the trained classifier
clf.score(X_test, y_test)
# Using the metrics method of Scikit-learn
from sklearn import metrics
metrics.accuracy_score(y_test, clf.predict(X_test))
Computing CV metrics
Yesterday we showed how to slit the data into folds for cross-validation using KFold()
and LeaveOneOut()
. Alternatively, we could use the cross_val_score()
function to “evaluate a score by cross-validation”.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X, y.values.ravel(), cv=3)
# clf is unfitted eg. clf = svm.SVC(kernel='linear', C=1)
# Print scores
print("The scores:", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
- NB: Different “scoring” metrics can be used
Getting CV predictions
Similary, we can use the cross_val_predict()
to “generate cross-validated estimates for each input data point”. In other words, to get the predicted result for each test set of the cross-validation.
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf, X, y.values.ravel(), cv=3)
- For an example, see here.