Four ways to cast a Pandas DataFrame from long to wide format

Because there are (too) many ways to cast a Pandas DataFrame from long to wide format, I decided to list four ways to achieve that goal. The four functions I describe in this article are the following.

Function	Object	aggregation	Can handle NaNs
pivot	DataFrame	no	no
pivot_table	DataFrame	yes	yes
unstack	DataFrame with MultiIndex rows	no	no
crosstab	Series/Arrays	yes	no

First, let’s create some dummy data. It’s a simple 5 row data frame that describes the color, taste and calories of fruit.

import random
import pandas as pd
import numpy as np

fruits = ['apple','apple', 'pear','pear','plum']
colors = ['green','green','red','yellow','red']
tastes = ['sour','bitter','sweet','sour', 'sweet']
calories = np.round(np.random.normal(50,10,5), decimals = 0)
df = pd.DataFrame({'fruit': fruits, 
                   'color': colors, 
                   'taste': tastes, 
                   'calories': calories})
df.head()

Pivot

The pivot function reshapes DataFrames by casting a the values of a column to a number of columns, based on the number of unique values within that column.

df.pivot(index = 'fruit', columns = 'taste', values = 'calories')

To use the pivot function, it is required that all column/index combinations are unique.

df.pivot(index = 'fruit',  columns = 'color', values = 'calories')

This example will produce the following error, because the combination of fruit/color produces a duplicate row, because when using pivot, the values aren’t aggregated.

ValueError: Index contains duplicate entries, cannot reshape

🙋‍♀️ Here’s an overview of what you should remember about pivot():

A function with limited features to reshape DataFrames from long to wide format
Requires a unique index/column combination
Does not aggregate values
Might produce NaNs, doesn’t have a parameter to handle them

Pivot_table

Just like pivot(), the pivot_table function reshapes DataFrames by casting them from long to wide.

df.pivot_table(index = 'fruit', columns = 'color', values = 'calories', \
               aggfunc = 'sum', fill_value = 0)

Unlike pivot(), it can aggregate numeric columns using the aggfunc parameter, so it does not need unique row/column combinations. Of course, it can still produce row/column combinations that are empty, but these can be filled using the fill_value parameter.

🙋‍♀️ Here’s an overview of what you should remember about pivot_table():

A function to reshape DataFrames from long to wide format
Can aggregate values
Can handle NaNs by filling it with a value of your choice

Unstack

The unstack function, is a fast and convenient way to cast a MultiIndex DataFrame from wide to long format. It will pivot the values of the index with the highest level. You’ll end up with a DataFrame with MultiIndex columns. Consequently, if your goal is to end up with a normal DataFrame, you’ll have to flatten it afterwards.

df[['fruit', 'taste', 'calories']].set_index(['fruit','taste']).unstack()

💥 Keep in mind, just like pivot(), the unstack function requires unique row/column combinations. If you don’t respect this, you’ll end up with the error “ValueError: Index contains duplicate entries, cannot reshape”.

🙋‍♀️ Here’s an overview of what you should remember about unstack():

A function to quickly pivot the index with the highest level to columns.
Can not aggregate values.
Can’t handle NaNs.

Crosstab

Another way to pivot your table from long to wide format, is using the crosstab function. It’s not necessarily designed for data transformation, but it is a great tool for presenting data in a comprehensible way. Hence the name ‘crosstab’.

pd.crosstab(index = df.fruit, columns = df.taste, values = df.calories, aggfunc = 'sum' )

Important to know is that if you set the values parameter, you also need to specify an aggfunc. Otherwise, you’ll run into the following error:

ValueError: values cannot be used without an aggfunc.

🙋‍♀️ Here’s an overview of what you should remember about crosstab():

Turn series or arrays into a DataFrame in wide format.
Requires agrgegating values.
Cannot handle NaNs.

By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Four ways to cast a Pandas DataFrame from long to wide format

Pivot

Pivot_table

Unstack

Crosstab

Say thanks, ask questions or give feedback

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error