Day05: Recognition with random forest

Day05 brings another Kaggle competition, “Digit Recognizer”. The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is.

The Jupyter Notebook for this little project is found here.

Random Forest

Today, I think I’ll use a random forest classifier. A random forest classifier is like an election where the outcome with the most votes (from the ensemble of decision trees) is the predicted classification. Here is a minimal example of how this is done in Python:

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc = fit(X_train, y_train)

rfc.predict(X_test)

Number of trees?

I split the train data into 2 sets: 80% for training and 20% for validation. I then (crudely) optimized for the number of trees (default = 10 trees) and decided that 30 trees is as good as 100 trees.

With more trees, comes greater training time.

Evaluation

Submitting the results to kaggle, we get a public score of “0.95871”.