Day94: Bagging

Posted by csiu on May 29, 2017 | with: 100daysofcode, Kaggle, Machine Learning

Another procedure I heard about is bagging.

The Jupyter Notebook for this little project is found here.

Bagging

According to wikipedia, bagging stands for stands for “bootstrap aggregating”, a special case of model averaging designed to improve stability and accuracy & reduce reduce variability and overfitting. Bagging is applied by:

  1. Generating a “bootstrap sample” by sampling with replacement (where the number of samples in the bootstrap sample is equal to the number of samples in the original dataset)
  2. Constructing a regression tree from the bootstrap sample
  3. Repeating steps 1-2 to get B regression trees
  4. Combining the B outputs by averaging (for regression) or voting (for classification)

In python, bagging can be specified using Scikit-learn’s sklearn.ensemble.BaggingClassifier function.

from sklearn.ensemble import BaggingClassifier

clf = BaggingClassifier()
clf = clf.fit(X_train, y_train)

Predicting on the House Prices dataset

Using the same Kaggle House Prices for Advanced Regression Techniques dataset as described yesterday (Day93), the prediction accuracy is comparable.

Yesterday’s

Today’s

In the bagging methods, we also see that (up to a point) more trees results to higher accuracy.