Day01: Kaggle competition

Posted by csiu on February 25, 2017 | with: 100daysofcode, Machine Learning, Kaggle

One idea that has been in the back of my mind for the past few weeks is the idea of doing the #100daysofcode challenge, but it never seemed like a good time to start. I had papers to write, edits to make, presentations to give, and places to be. But after chatting with a few people and listening to their advice for success, I decided that now is the right time to start.

Today is Day01.

For Day01, I decided to do a Kaggle competition …

Kaggle is founded in April 2010 by Anthony Goldbloom. It runs programming contests to crowdsource machine learning solutions and is data science community offering forums, public datasets, and tutorials.

… titled “Titanic: Machine Learning from Disaster”. The goal of this competition is to predict whether or not each passenger, in a test set, survived the sinking of the ship.

One of the things that I’ve always meant to do was to get back to using Jupyter Notebooks. During the course of my Masters in Bioinformatics, I managed to do most things in R: the combination of R Markdown and RStudio, has a very notebook-esque feel; creating R Markdown websites and R packages is another layer of complexity that I took advantage of for organizing and reproducing analyses; and finally, the readability of Hadley Wickham’s suite of hadleyverse packages including dplyr, tidyr, and readr plus ggplot2 made everything so much easier to use.

For this Kaggle competition, I used Python.

Setting up

The first step was to create my work environment and install a couple of packages (this is not the full list).

conda create --name kaggle python=3 jupyter pandas matplotlib scikit-learn ...

Similar but different

Starting the notebook, I loaded the input files and proceeded to do the analyses.

It wasn’t that simple.

Things that were trivial in R took way more time in Python. I had to Google for every other thing that I wanted to do. For instance, to load a file and filter for rows matching some conditions, in R would be:

library(dplyr)

readr::read_csv("train.csv") %>%
  dplyr::filter(Survived == 1, Sex == "male")

After some Googling, the same task is achieved in Python by:

import pandas as pd

train = pd.read_csv(train_file)
train[(train["Survived"]==1) & (train["Sex"]=="male")]

During my search, one site that I enjoy was Manish Amde’s Pandas and Python: Top 10 features.

Data exploration & visualization

With any new data set, one of the first things to do is to get a look and feel of the data. Making a few plots is a good idea, which lead to looking up Python packages for data visualization:

  • matplotlib 10 yro; big following
  • Seaborn built on top of matplotlib; nice defaults
  • ggplot R-esque
  • Bokeh interactive
  • Plotly interactive
  • pygal interactive; can output SVGs
  • pandas

Seeing that I was already using pandas, I ended up using pandas for plotting. The plots that I quickly generated can be found on Kaggle at Titanic: how many factor levels?

Logistic regression

Finally, we move on to the actual making of the predictions.

Because I’m unfamiliar with scikit-learn, I decided decided to start simple and use logistic regression to predict the binary outcome of survival.

Much to my pleasure, scikit-learn is not hard to use.

# Get my input from 'train' pandas DataFrame object
train_x = train[features].values
train_y = train.Survived.values

# Create linear regression object
regr = linear_model.LogisticRegression()

# Train the model
regr.fit(X=train_x, y=train_y)

# Make the predictions
regr.predict(test_x)

The hard/most time consuming part was the data wrangling. The cleaning of the data, dealing with missing values, and formatting the data to the correct input format.

With the logistic regression, I used 6 features and achieved an accuracy/public score of 0.74641.