Home » Split column values on separator and create dummies in Pandas

Split column values on separator and create dummies in Pandas

  • by
  • 2 min read

I’ve spent hours trying to find an elegant solution for this problem, and I’m ashamed about how easy it eventually was. I had a column with comma-separated values that I wanted to create features from. Each value should represent one feature. I got started with building loops over the values in that column but somehow I was convinced there must have been a more elegant solution. And a more elegant solution I found.

Let’s assume you have a column in your Pandas DataFrame that looks like this:

instruments
violin, bass, piano
bass, drums
cello, piano, bass
violin
drums, piano
...

And assume further that you’d like to build features out of it. That means, you would want dummies and in this example, that would look like this:

violinbasspianodrumscello
11100
01010
01101
10000
00110

The solution is surprisingly simple: there is a Pandas string method that goes by the name get_dummies. I’ve only used it before for one-hot encoding (although it’s troublesome with unseen data). But apparently, it can handle multiple categories, divided by a separator. All you need is one simple argument.

x['instruments'].str.get_dummies(', ')

This will produce the exact same table that I referred to above.

Danger: Do not use this solution on your complete dataset or you will generate data leakage. You do not want to create features out of categories that are not in your training data. What you could do is filter the columns to only contain features from the training data. Like this:

supported_columns = X_train['instruments'].str.get_dummies(', ').columns

X_train_categories = X_train['instruments'].str.get_dummies(', ')
X_val_categories = X_val['instruments'].str.get_dummies(', ')

# This is where you filter and prevent data leakage
X_val_categories = X_val_categories.filter(supported_columns)

X_train = X_train.join(X_train_categories)
X_val = X_val.join(X_val_categories)

# Drop the instruments column
X_train.drop(['instruments'],axis = 1)
X_val.drop(['instruments'],axis = 1)

Finally, if you have a lot of values in the initial column, you might want to add a suffix or prefix to the features you created from it. This can be done fairly easily with a lambda function.

x['instruments'].str.get_dummies(sep=', ').rename(lambda x: 'instrument_' + x, axis='columns')

By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *