How to do a SUMIF in PySpark

One of the most frequent used Excel functions is probably SUMIF and its SUMIFS variant. In this article, you’ll learn how to do exactly the same in PySpark.

What is the sumif function?

In Excel, the SUMIF function is an aggregation function for summing values from a column, but only those values for which a condition is true. This condition does not necessarily need to relate to the values, it can also relate to values in other columns.

In the table below, a SUMIF of column Surface where Shape is “Triangle” would result in 11.

Color	Shape	Surface
Red	Triangle	5
Blue	Square	10
Green	Triangle	6

Using sumif in PySpark

To achieve the same thing in PySpark, one needs to combine a when with the sum aggregation function. In the following code chunk, I’ve added two variants of this function.

The first version takes a SQL-like conditional statement and a column name.
The second version takes a PySpark-like conditional statement, and a column reference.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

sumif = lambda sql_expr, col_name: F.sum(F.expr(f"CASE WHEN {sql_expr} THEN {col_name} ELSE 0 END"))
sumif2 = lambda pyspark_expr, col: F.sum(F.when(pyspark_expr, col).otherwise(0))

If you want to group some data, and do an aggregation with one of the two sumif variations, this is how it’s done:

# sdf is a Spark Data Frame.
sdf \
	.groupby('Color')
	.agg(
      sumif("Shape = 'Triangle'", 'Surface').alias('sum_Surface'),
      sumif2(F.col('Shape') == 'Triangle', F.col('Surface')).alias('sum2_Surface')
    ) \
  .show()

There we go.

How to do a SUMIF in PySpark

What is the sumif function?

Using sumif in PySpark

Leave a Reply Cancel reply

Related Posts

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error

Python & NetworkX: Set node attributes from Pandas DataFrame