Day10: Feature selection

Posted by csiu on March 6, 2017 | with: 100daysofcode, Machine Learning, Kaggle

Yesterday I alluded to some features being more informative than others. Today I want to take a look at feature selection. We will reuse the Digit Recognizer Kaggle MNIST data set.

The Jupyter Notebook for this little project is found here.

Feature selection

Feature selection is the subsetting of relevant features (e.g. variables, predictors) for use in model construction. According to Jason Brownlee (2014), feature selection “reduces overfitting, improves accuracy, and reduces training time”.

In today’s post, we will go through the feature selection documentation on scikit-learn.

Approach

  1. Select features
  2. Train a neural network (1 hidden layer containing
    mean(features, classes) nodes) on 80% of training samples
  3. Validate on remaining 20% of training samples

Note: There are 784 features and the model accuracy on the validation set is 0.9655.

Removing features with low variance

In Removing features with low variance, all features whose variance does not meet some threshold are removed.

from sklearn.feature_selection import VarianceThreshold

# Arbitrarily set threshold to 0.1
sel = VarianceThreshold(threshold = 0.1)
sel.fit_transform(X_train)

# Subset features
X_new = sel.transform(X_train)

When we arbitrarily set the threshold to 0.1, we end up with 689 features and an accuracy of 0.9663 on the validation set. This accuracy is slightly better than predicting using all features.

Univariate feature selection

In univariate feature selection, all but the highest scoring features are removed. In Scikit-learn, SelectKBest() and SelectPercentile() are mentioned for univariate feature selection.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2

# Arbitrarily select to 90 percentile of features
sel = SelectPercentile(chi2, percentile=90)
sel.fit_transform(X_train, y_train)

# Subset features
X_new = sel.transform(X_train)

When we arbitrarily take the top 90th percentile of features, we end up with 705 features and an accuracy of 0.9633 on the validation set. This accuracy is comparable but slightly worst than removing features with low variance.

Recursive Feature Elimination

According to Feature Selection in Python with Scikit-Learn, recursive feature elimination works by “recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression()

# create the RFE model and select 350 attributes
rfe = RFE(model, 350)
rfe.fit(X_train, y_train)

# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

This was pretty slow and I stopped it before it could finish. Overall, a failed attempt at feature selection.

Feature Importance

Feature importance is selecting features that are most important from a previous classifier. For example, selecting the most important features from a number of randomized decision trees.

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

# Fits a number of randomized decision trees
model = ExtraTreesClassifier(n_estimators = 30)
model.fit(X_train, y_train)

# Select the important features of previous model
sel = SelectFromModel(model, prefit=True)

# Subset features
X_new = sel.transform(X_train)

When we arbitrarily use 30 decision trees, we end up with 272 features and an accuracy of 0.9548 on the validation set.

Evaluation on Kaggle

From the analysis, removing features with low variance gave the best performance. When we selected 688 features by removing the features with low variance on the full data set and submitted the predictions to Kaggle, we had a submission score of 0.96343, which is not an improvement of the best score (ie. neural network without feature selection).

In conclusion, feature selection may not be necessary on neural networks.