Day24: Transforming categorical variables

Posted by csiu on March 20, 2017 | with: 100daysofcode, Machine Learning

Yesterday I skipped the categorical variables. Today I’ll deal with them.

The Jupyter Notebook for this little project is found here.

Preprocessing of categorical variables

Some machine learning algorithms (such as linear regression or SVM) can only handle numerical inputs. For categorical variables to be used as model features, they will need to be converted into numbers.

In practice, why do we convert categorical class labels to integers for classification?

Example of a categorical variable

The Kaggle House Prices data set has 43 categorical variables, and I will arbitrary use “MSZoning” in my examples.

From the figure, we see that “MSZoning” can take on 5 non-numerical values.

Assign a number

The following 3 methods represent non-numerical variable levels by a number.

# pd.factorize()
# --------------
# Encode input values as an enumerated type or categorical variable
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html
df = pd.factorize(train_cat.MSZoning)
# pd.map()
# --------
# Map values using eg. a dict
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html

# Can define mappings manually or with enumerate
class_mapping = {'RL':0, 'RM':1, 'C (all)':2, 'FV':3, 'RH':4}
class_mapping = {v:i for i,v in enumerate(train_cat.MSZoning.unique())}

df = train_cat.MSZoning.map(class_mapping)
# sklearn.preprocessing.LabelEncoder()
# ------------------------------------
# Encode labels with value between 0 and n_classes-1.
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(train_cat.MSZoning)

print("CLASSES:", list(le.classes_))
print("TRANSFORM:", le.transform(["FV", "RH", "RM"]))
print("INVERSE:", list(le.inverse_transform([2, 2, 0])))
#> CLASSES: ['C (all)', 'FV', 'RH', 'RL', 'RM']
#> TRANSFORM: [1 2 4]
#> INVERSE: ['RH', 'RH', 'C (all)']

Assigning a number implies order. Categorical variables may not necessarily be ordinal (have order) and so another encoding should be used.

One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. This estimator transforms each categorical feature with m possible values into m binary features, with only one active. –scikit-learn: Preprocessing data

One-hot encoding

In one-hot encoding, the number of features expand. For example, for a feature X with 3 classes {x1, x2, x3}, the one-hot encoding will produce 3 binarized features: {X_x1, X_x2, X_x3}.

Example: Imagine we have a test data frame with a quantitative (“A”) and categorical (“Color”) variable. The quantitative variable is okay to use as it is, but the categorical will need to change.

test_df = pd.DataFrame({"Color": ['Red', 'Blue', "Red", "Green"], "A":[5, 6, 1, 9]})
#>     A  Color
#>  0  5  Red
#>  1  6  Blue
#>  2  1  Red
#>  3  9  Green

Applying the one-hot encoding to the “Color” column, we get additional features.

# sklearn.preprocessing.OneHotEncoder()
# -------------------------------------
# Encode categorical integer features using a one-hot aka one-of-K scheme.
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
from sklearn import preprocessing

# Map colors to arbitrary integers
class_mapping = {'Blue':0, 'Red':1, 'Green':2}
test_df["Color"] = test_df["Color"].map(class_mapping)
#>     A  Color
#>  0  5      1
#>  1  6      0
#>  2  1      1
#>  3  9      2

# One-hot encoding
enc = preprocessing.OneHotEncoder(categorical_features=[1])
enc.fit_transform(test_df).toarray()

# Note: The encoded columns represent: Blue, Red, Green, A
#>  array([[ 0.,  1.,  0.,  5.],
#>         [ 1.,  0.,  0.,  6.],
#>         [ 0.,  1.,  0.,  1.],
#>         [ 0.,  0.,  1.,  9.]])

Another method to apply one-hot encoding is to use pd.get_dummies(), which will apply one-hot encoding to multiple columns.

# pd.get_dummies()
# ----------------
# Convert categorical variable into dummy/indicator variables
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
df = pd.get_dummies(train_cat)

#>      A  Color_Blue  Color_Green  Color_Red
#>  0   5           0            0          1
#>  1   6           1            0          0
#>  2   1           0            0          1
#>  3   9           0            1          0