I’ve spent hours trying to find an elegant solution for this problem, and I’m ashamed about how easy it eventually was. I had a column with comma-separated values that I wanted to create features from. Each value should represent one feature. I got started with building loops over the values in that column but somehow I was convinced there must have been a more elegant solution. And a more elegant solution I found.
Let’s assume you have a column in your Pandas DataFrame that looks like this:
instruments
violin, bass, piano
bass, drums
cello, piano, bass
violin
drums, piano
...
And assume further that you’d like to build features out of it. That means, you would want dummies and in this example, that would look like this:
violin | bass | piano | drums | cello |
1 | 1 | 1 | 0 | 0 |
0 | 1 | 0 | 1 | 0 |
0 | 1 | 1 | 0 | 1 |
1 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 | 0 |
The solution is surprisingly simple: there is a Pandas string method that goes by the name get_dummies. I’ve only used it before for one-hot encoding (although it’s troublesome with unseen data). But apparently, it can handle multiple categories, divided by a separator. All you need is one simple argument.
x['instruments'].str.get_dummies(', ')
This will produce the exact same table that I referred to above.
Danger: Do not use this solution on your complete dataset or you will generate data leakage. You do not want to create features out of categories that are not in your training data. What you could do is filter the columns to only contain features from the training data. Like this:
supported_columns = X_train['instruments'].str.get_dummies(', ').columns X_train_categories = X_train['instruments'].str.get_dummies(', ') X_val_categories = X_val['instruments'].str.get_dummies(', ') # This is where you filter and prevent data leakage X_val_categories = X_val_categories.filter(supported_columns) X_train = X_train.join(X_train_categories) X_val = X_val.join(X_val_categories) # Drop the instruments column X_train.drop(['instruments'],axis = 1) X_val.drop(['instruments'],axis = 1)
Finally, if you have a lot of values in the initial column, you might want to add a suffix or prefix to the features you created from it. This can be done fairly easily with a lambda function.
x['instruments'].str.get_dummies(sep=', ').rename(lambda x: 'instrument_' + x, axis='columns')
By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.
Great success!