Split column values on separator and create dummies in Pandas

I’ve spent hours trying to find an elegant solution for this problem, and I’m ashamed about how easy it eventually was. I had a column with comma-separated values that I wanted to create features from. Each value should represent one feature. I got started with building loops over the values in that column but somehow I was convinced there must have been a more elegant solution. And a more elegant solution I found.

Let’s assume you have a column in your Pandas DataFrame that looks like this:

instruments
violin, bass, piano
bass, drums
cello, piano, bass
violin
drums, piano
...

And assume further that you’d like to build features out of it. That means, you would want dummies and in this example, that would look like this:

violin	bass	piano	drums	cello
1	1	1	0	0
0	1	0	1	0
0	1	1	0	1
1	0	0	0	0
0	0	1	1	0

The solution is surprisingly simple: there is a Pandas string method that goes by the name get_dummies. I’ve only used it before for one-hot encoding (although it’s troublesome with unseen data). But apparently, it can handle multiple categories, divided by a separator. All you need is one simple argument.

x['instruments'].str.get_dummies(', ')

This will produce the exact same table that I referred to above.

Danger: Do not use this solution on your complete dataset or you will generate data leakage. You do not want to create features out of categories that are not in your training data. What you could do is filter the columns to only contain features from the training data. Like this:

supported_columns = X_train['instruments'].str.get_dummies(', ').columns

X_train_categories = X_train['instruments'].str.get_dummies(', ')
X_val_categories = X_val['instruments'].str.get_dummies(', ')

# This is where you filter and prevent data leakage
X_val_categories = X_val_categories.filter(supported_columns)

X_train = X_train.join(X_train_categories)
X_val = X_val.join(X_val_categories)

# Drop the instruments column
X_train.drop(['instruments'],axis = 1)
X_val.drop(['instruments'],axis = 1)

Finally, if you have a lot of values in the initial column, you might want to add a suffix or prefix to the features you created from it. This can be done fairly easily with a lambda function.

x['instruments'].str.get_dummies(sep=', ').rename(lambda x: 'instrument_' + x, axis='columns')

By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Great success!

Split column values on separator and create dummies in Pandas

Say thanks, ask questions or give feedback

Say thanks, ask questions or give feedback

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error