Today, I tried a data transformation that seemed so obvious: splitting the string values of a Pandas column on a delimiter and one-hot encode the resulting strings. However, it took me quite some time to figure out how to do it elegantly.
Here’s what I wanted to achieve. I had a DataFrame like this:
I wanted to turn it into this:
To break it down, this can be achieved by doing two transformations:
- Split string on a delimiter
- One-hot encode the resulting values
Although it looks seemingly easy, I had a hard time imagining how one goes from the intermediate state (columns with the first, second and third string after splitting them) to the final state.
Luckily, Pandas has an out-of-the-box method for achieving both transformations at once. That method is the get_dummies Series method, which differs a lot from Pandas’ general function with the same name.
By using the sep parameter, one can apply one-hot encoding to a single Series that has multiple values split by a delimiter:
df['string_column'].str.get_dummies(sep = ',')
Simple as that!
By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.