Home » Split column values on separator and create dummies in Pandas

# Split column values on separator and create dummies in Pandas

I’ve spent hours trying to find an elegant solution for this problem, and I’m ashamed about how easy it eventually was. I had a column with comma-separated values that I wanted to create features from. Each value should represent one feature. I got started with building loops over the values in that column but somehow I was convinced there must have been a more elegant solution. And a more elegant solution I found.

Let’s assume you have a column in your Pandas DataFrame that looks like this:

instruments
violin, bass, piano
bass, drums
cello, piano, bass
violin
drums, piano
...

And assume further that you’d like to build features out of it. That means, you would want dummies and in this example, that would look like this:

The solution is surprisingly simple: there is a Pandas string method that goes by the name get_dummies. I’ve only used it before for one-hot encoding (although it’s troublesome with unseen data). But apparently, it can handle multiple categories, divided by a separator. All you need is one simple argument.

x['instruments'].str.get_dummies(', ')

This will produce the exact same table that I referred to above.

Danger: Do not use this solution on your complete dataset or you will generate data leakage. You do not want to create features out of categories that are not in your training data. What you could do is filter the columns to only contain features from the training data. Like this:

supported_columns = X_train['instruments'].str.get_dummies(', ').columns

X_train_categories = X_train['instruments'].str.get_dummies(', ')
X_val_categories = X_val['instruments'].str.get_dummies(', ')

# This is where you filter and prevent data leakage
X_val_categories = X_val_categories.filter(supported_columns)

X_train = X_train.join(X_train_categories)
X_val = X_val.join(X_val_categories)

# Drop the instruments column
X_train.drop(['instruments'],axis = 1)
X_val.drop(['instruments'],axis = 1)

Finally, if you have a lot of values in the initial column, you might want to add a suffix or prefix to the features you created from it. This can be done fairly easily with a lambda function.

x['instruments'].str.get_dummies(sep=', ').rename(lambda x: 'instrument_' + x, axis='columns')

Great success!

### Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.