This is a cool one I used for a feature engineering task I did recently. I had multiple documents in a Pandas DataFrame, in long format. These documents belonged to people and it had an n:1 relation: people could have multiple documents. I was wondering how to concatenate each person’s documents while grouping the DataFrame per person. Here’s how I solved it.
Here is some dummy data.
import pandas as pd df = pd.DataFrame(data = {'id': ['aaa', 'aaa', 'bbb', 'bbb', 'ccc'], 'val': ['hello', 'world', 'foo', 'bar', 'rabbit']})
In the next chunk of code, I group the DataFrame by id. I make sure to set as_index to False, because I need the id later on. Next I aggregate the val column by joining all the documents per id.
df.groupby(['id'], as_index = False).agg({'val': ' '.join})
Mission solved! But there’s a nice extra. Oftentimes, you’re gonna want more than just concatenate the text. It might be interesting to know other properties. By passing a list of functions, you can actually set multiple aggregations for one column. In the next line of code I count the amount of rows per ID.
df.groupby(['id'], as_index = False).agg({'val': [' '.join, 'count']})
This will return a MultiIndex DataFrame with the initial column name as the highest level, and the function names below. Read my other blog post if you want to know how to get rid of the MultiIndex object.
By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.
Great success!