Recently, I needed to calculate the relative change of a value compared to the first value within a DataFrame group, a GroupBy object. With SQL, I would use a window function (e.g. PARTITION BY in PostgreSQL). In this article, we’ll do it in Pandas (Python).
To demonstrate, let’s start with creating a DataFrame. Each row is a subscriber, for which you have to order they subscribed in, their gender and their height.
import pandas as pd df = pd.DataFrame({ 'subscription_order': [3,6,1,2,4,8,5,7], 'gender': ['M', 'M', 'M', 'F', 'F', 'F', 'X', 'X'], 'height': [180, 195, 165, 178, 164, 173, 182, 163] }) df.sort_values(['gender','subscription_order'], inplace = True)
The mission: calculate the relative height of each subscriber to the height of the first subscriber of that gender.
To achieve this, you would like to have a column that has on each row the height of the first subscriber for the gender each subscriber belongs to. Finally, you’d like to divide the height of the subscriber by the height of that first member.
You could do the aggregation, followed by a join on the initial DataFrame. But that’s really lame, no? Here’s the trick: to broadcast an aggregated value to each row, we can use this handy property of the transform method:
f must return a value that either has the same shape as the input subframe or can be broadcast to the shape of the input subframe. For example, if f returns a scalar it will be broadcast to have the same shape as the input subframe.
Documentation of the transform method
Inside it, you specify the required aggregation function. In our case: first.
df['first_height'] = df.groupby('gender')['height'].transform('first')
By the way, you can also do this with a lambda function. Here are two alternatives — the latter is syntactic sugar for the former. Keep in mind that these are ~60% slower.
df['first_height'] = df.groupby('gender')['height'].transform(lambda x: np.take(x.values, 0)) df['first_height'] = df.groupby('gender')['height'].transform(lambda x: x.values[0])
To finish our mission, we now divide the height column by the first_height column.
df['compared_to_first'] = df['height'] / df['first_height']
And that’s how it’s done.
By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.
Hey, Jack here. I’m hooked on your website’s content – it’s informative, engaging, and always up-to-date. Thanks for setting the bar high!