In this blog post, I explore the options for taking the row-wise sum of a subset of columns from a Pandas DataFrame.
import numpy as np import pandas as pd df = pd.DataFrame({ 'first_column': np.random.randint(0,100,50000), 'second_column': np.random.randint(0,100,50000), 'third_column': np.random.randint(0,100,50000), 'other_column': np.random.randint(0,100,50000) })
Let’s try to take the row-wise sum of the columns first_column, second_column, and third_column. This means we’re leaving other_column out.
But first, let’s take a step back. If you’d like to sum all the columns, you simply have to use the sum method and set the axis parameter to 1.
df.sum(axis = 1)
If we’d like to sum a subset of the columns, we have to select them, before calling the sum method. This can be done in multiple ways.
Row-wise sum of a subset of columns via integer indices
First, we’ll sum the columns by subsetting them by their integer index. For this, we use the iloc method.
If the columns are in a sequential position, you can use a slice object.
df.iloc[:,0:3].sum(axis = 1) # 500 microseconds
If the columns are not in a sequential position, you can use a list. Keep in mind that this subsetting method is twice as slow as using a slice.
df.iloc[:,[0,1,2]].sum(axis = 1) # 1 millisecond
You can also use the DataFrame as a NumPy array object, by using df.values. On that object, you can use a slice (which uses basic index) or a list indexer (which uses advanced/fancy index mechanisms, slowing it down).
df.values[:,0:3].sum(axis = 1) # 160 microseconds df.values[:,[0,1,2]].sum(axis = 1) # 350 microseconds
You can see that is by far the fastest way of subsetting columns and summing them.
Row-wise sum of a subset of columns via labels
Next thing we try is summing a subset of columns that we selected via their labels. Once again, this is twice to 2.5 times as slow as using a slice. You can use loc, or drop it.
df.loc[:,['first_column', 'second_column', 'third_column']].sum(axis = 1) df[['first_column', 'second_column', 'third_column']].sum(axis = 1) # 1.3 milliseconds
Pandas also supports a slice via labels. The following is more concise and once again, a lot faster.
df.loc[:,'first_column':'third_column'].sum(axis = 1) # 650 microseconds
An alternative is to turn each column into a pandas Series and sum them individually. In my opinion, this isn’t a very elegant solution. It’s faster than the list subset but not as fast as the slice subset.
df['first_column'] + df['second_column'] + df['third_column'] # 900 microseconds