Pandas' pivot_table vs. pivot

When you’re an R poweruser, pivoting tables in pandas feels unnecessarily complex. Why are there two pivot functions? Why does it return an index when you wanted a column? Why does it generate multi index columns? Those are the questions I tackle in this blog post.

💥 This blog post has received an update over here.

To answer some questions about pivoting in pandas, I first generate some dummy data.

import random
import pandas as pd
import numpy as np

fruits = random.choices(['apple','pear','plum'], k = 1000)
colors = random.choices(['green','yellow','red','orange','brown'], k = 1000)
tastes = random.choices(['sour','bitter','sweet'], k = 1000)
sizes = random.choices(['small','big'], k = 1000)
calories = np.round(np.random.normal(50,10,1000), decimals = 0)
df = pd.DataFrame({'fruit': fruits, 
                   'color': colors, 
                   'taste': tastes, 
                   'size': sizes, 
                   'calories': calories})
df.head()

Let’s say, we want to turn the colors into columns by pivoting them using the pivot_table() function. For every fruit we want to know the amount per color. If we do this analogously to how we use dcast in R, we would do something like this. First of all, if we don’t want the fruit as the index, but as a column we have to use the reset_index() function. Resetting the index is not necessary. However, as an R user, it feels more natural to me. If you like stacking and unstacking DataFrames, you shouldn’t reset the index.

df.pivot_table(columns = 'color', index = 'fruit', aggfunc = len).reset_index()

But more importantly, we get this strange result. Every column we didn’t use in our pivot_table() function has been used to calculate the number of fruits per color and the result is constructed in a hierarchical DataFrame. That wasn’t supposed to happen.

There’s two ways we can solve this. First, we can select one column that we want to feed to the len() function (the aggfunc parameter). The second option is to limit the amount of columns you want to work with before applying the pivot_table() function to it.

df.pivot_table(columns = 'color', index = 'fruit', values = 'taste', aggfunc = len).reset_index() # option 1
df[['color','fruit']].pivot_table(columns = 'color', index = 'fruit', aggfunc = len).reset_index() #option 2

Both solutions will produce the same result.

Okay, but what does the pivot() function offer? This is what the documentation says:

Reshape data (produce a “pivot” table) based on column values. Uses unique values
from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns.

Basically, the pivot_table() function is a generalization of the pivot() function that allows aggregation of values — for example, through the len() function in the previous example. Pivot only works — or makes sense — if you need to pivot a table and show values without any aggregation. Here’s an example.

fruits = ['apple','apple','apple','pear','pear','pear','prune','prune','prune']
colors = ['red','green','orange','red','green','orange','red','green','orange']
calories = np.round(np.random.normal(50,10,9), decimals = 0)
df = pd.DataFrame({'fruit': fruits, 
                   'color': colors, 
                   'calories': calories})

df.pivot(index = 'fruit', columns = 'color', values = 'calories').reset_index()

Nevertheless, you can get the same result using pivot_table, but it’s a bit silly to take the mean of a single value.

df.pivot_table(index = 'fruit', columns = 'color', values = 'calories', aggfunc = np.mean).reset_index()

Pivotting in pandas offers a lot more functionalities than in R. As a pandas starter, these features felt somewhat overwhelming to me. By sharing my struggles, I hope you have learned a thing or two.

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Pandas’ pivot_table vs. pivot

Say thanks, ask questions or give feedback

Leave a Reply Cancel reply

Pandas’ pivot_table vs. pivot

Say thanks, ask questions or give feedback

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error