Day20: Important features

Posted by csiu on March 16, 2017 | with: 100daysofcode, Machine Learning

The Jupyter Notebook for this little project is found here.

Yesterday we explored the Kaggle HR dataset to answer a few dashboard type questions. Moving beyond the scope of a data analyst and into the scope of a data scientist, I asked the following:

Given the following features, are we able to predict a person’s salary and are we able to identify features that are more informative with regards to predicting a person’s salary?

satisfaction_level
last_evaluation
number_project
average_montly_hours
time_spend_company
work_accident
left_workplace
promotion_last_5years

Approach

  1. Split the dataset into train (80%) and test (20%) sets
  2. Train a random forest classifier
  3. Evaluate model prediction accuracy
  4. Identify important features

Random Forest Classifier

When we train the classifier, we get an accuracy of 96.4%.

# Split the data set: train on 80% and test on 20%
n = len(X)
n_80 = int(n * .8)

X_train = X[:n_80]
Y_train = Y[:n_80]
X_test = X[n_80:]
Y_test = Y[n_80:]

# Train classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, Y_train)

# Evaluate accuracy
sum(rfc.predict(X_test) == Y_test)/len(X_test)

Important features

We find the most important features in predicting salary are (1) average_montly_hours, (2) last_evaluation, and (3) satisfaction_level.

Prediction accuracy with vs without important features

We next asked how would the prediction accuracy change if we predict only on the important features vs only on the non-important features.

When we consider only the important features, prediction accuracy was 95.0%. (This is comparable to the 96.4% accuracy trained on all features.) Without those features, the accuracy dropped to 57.2%.