Home » How to concatenate text as aggregation in a Pandas groupby

How to concatenate text as aggregation in a Pandas groupby

  • by
  • 2 min read

This is a cool one I used for a feature engineering task I did recently. I had multiple documents in a Pandas DataFrame, in long format. These documents belonged to people and it had an n:1 relation: people could have multiple documents. I was wondering how to concatenate each person’s documents while grouping the DataFrame per person. Here’s how I solved it.

Here is some dummy data.

import pandas as pd
df = pd.DataFrame(data = {'id': ['aaa', 'aaa', 'bbb', 'bbb', 'ccc'],
                          'val': ['hello', 'world', 'foo', 'bar', 'rabbit']})

In the next chunk of code, I group the DataFrame by id. I make sure to set as_index to False, because I need the id later on. Next I aggregate the val column by joining all the documents per id.

df.groupby(['id'], as_index = False).agg({'val': ' '.join})

Mission solved! But there’s a nice extra. Oftentimes, you’re gonna want more than just concatenate the text. It might be interesting to know other properties. By passing a list of functions, you can actually set multiple aggregations for one column. In the next line of code I count the amount of rows per ID.

df.groupby(['id'], as_index = False).agg({'val': [' '.join, 'count']})

This will return a MultiIndex DataFrame with the initial column name as the highest level, and the function names below. Read my other blog post if you want to know how to get rid of the MultiIndex object.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *