Skip to content
Home ยป How to do a SUMIF in PySpark

How to do a SUMIF in PySpark

  • by
  • 2 min read

One of the most frequent used Excel functions is probably SUMIF and its SUMIFS variant. In this article, you’ll learn how to do exactly the same in PySpark.

What is the sumif function?

In Excel, the SUMIF function is an aggregation function for summing values from a column, but only those values for which a condition is true. This condition does not necessarily need to relate to the values, it can also relate to values in other columns.

In the table below, a SUMIF of column Surface where Shape is “Triangle” would result in 11.

ColorShapeSurface
RedTriangle5
BlueSquare10
GreenTriangle6

Using sumif in PySpark

To achieve the same thing in PySpark, one needs to combine a when with the sum aggregation function. In the following code chunk, I’ve added two variants of this function.

  • The first version takes a SQL-like conditional statement and a column name.
  • The second version takes a PySpark-like conditional statement, and a column reference.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

sumif = lambda sql_expr, col_name: F.sum(F.expr(f"CASE WHEN {sql_expr} THEN {col_name} ELSE 0 END"))
sumif2 = lambda pyspark_expr, col: F.sum(F.when(pyspark_expr, col).otherwise(0))

If you want to group some data, and do an aggregation with one of the two sumif variations, this is how it’s done:

# sdf is a Spark Data Frame.
sdf \
	.groupby('Color')
	.agg(
      sumif("Shape = 'Triangle'", 'Surface').alias('sum_Surface'),
      sumif2(F.col('Shape') == 'Triangle', F.col('Surface')).alias('sum2_Surface')
    ) \
  .show()

There we go.

Leave a Reply

Your email address will not be published. Required fields are marked *