I have been preprocessing the Kaggle House Prices data set without any baseline on how well the naive model performs. Today I submit to Kaggle to get an idea of the sales price predictions.
The Jupyter Notebook for this little project is found here.
Features
As a reminder, this data set has 79 features total (43 of which are categorical).
When we apply the one-hot encoding to transform categorical variables to numerical ones, we get a total of 288 features. Some of these features contain NAs and will need to be dropped leaving us 285 features.
X.dropna(axis=1, how='any', inplace=True)
Kaggle: Linear regression of only quantitative variables
Of the 33 quantitative features, 25 features are not NA in the test set. Using a linear regression model with the 25 features yields a Kaggle score of 0.43803.
Kaggle: Linear regression of one-hot encoding + quantitative features
Of the 285 features, 259 are not NA in both the train and test set. When we use these features to train a linear model, the Kaggle submission score is 0.45646 (which is worst than the original prediction with 25 quantitative features).
The lower the score is, the better the predictions are.
Room for improvement
- Look into feature selection to improve prediction accuracy
- Look into other regression type models