Home ยป How to take a row-wise sum of columns in Pandas

How to take a row-wise sum of columns in Pandas

  • by
  • 3 min read

In this blog post, I explore the options for taking the row-wise sum of a subset of columns from a Pandas DataFrame.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'first_column': np.random.randint(0,100,50000),
    'second_column': np.random.randint(0,100,50000),
    'third_column': np.random.randint(0,100,50000),
    'other_column': np.random.randint(0,100,50000)
})

Let’s try to take the row-wise sum of the columns first_column, second_column, and third_column. This means we’re leaving other_column out.

But first, let’s take a step back. If you’d like to sum all the columns, you simply have to use the sum method and set the axis parameter to 1.

df.sum(axis = 1)

If we’d like to sum a subset of the columns, we have to select them, before calling the sum method. This can be done in multiple ways.

Row-wise sum of a subset of columns via integer indices

First, we’ll sum the columns by subsetting them by their integer index. For this, we use the iloc method.

If the columns are in a sequential position, you can use a slice object.

df.iloc[:,0:3].sum(axis = 1) # 500 microseconds

If the columns are not in a sequential position, you can use a list. Keep in mind that this subsetting method is twice as slow as using a slice.

df.iloc[:,[0,1,2]].sum(axis = 1) # 1 millisecond

You can also use the DataFrame as a NumPy array object, by using df.values. On that object, you can use a slice (which uses basic index) or a list indexer (which uses advanced/fancy index mechanisms, slowing it down).

df.values[:,0:3].sum(axis = 1) # 160 microseconds
df.values[:,[0,1,2]].sum(axis = 1) # 350 microseconds

You can see that is by far the fastest way of subsetting columns and summing them.

Row-wise sum of a subset of columns via labels

Next thing we try is summing a subset of columns that we selected via their labels. Once again, this is twice to 2.5 times as slow as using a slice. You can use loc, or drop it.

df.loc[:,['first_column', 'second_column', 'third_column']].sum(axis = 1)
df[['first_column', 'second_column', 'third_column']].sum(axis = 1) # 1.3 milliseconds

Pandas also supports a slice via labels. The following is more concise and once again, a lot faster.

df.loc[:,'first_column':'third_column'].sum(axis = 1) # 650 microseconds

An alternative is to turn each column into a pandas Series and sum them individually. In my opinion, this isn’t a very elegant solution. It’s faster than the list subset but not as fast as the slice subset.

df['first_column'] + df['second_column'] + df['third_column'] # 900 microseconds

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published.