One of the most frequent used Excel functions is probably SUMIF and its SUMIFS variant. In this article, you’ll learn how to do exactly the same in PySpark.
What is the sumif function?
In Excel, the SUMIF function is an aggregation function for summing values from a column, but only those values for which a condition is true. This condition does not necessarily need to relate to the values, it can also relate to values in other columns.
In the table below, a SUMIF of column Surface where Shape is “Triangle” would result in 11.
Color | Shape | Surface |
Red | Triangle | 5 |
Blue | Square | 10 |
Green | Triangle | 6 |
Using sumif in PySpark
To achieve the same thing in PySpark, one needs to combine a when with the sum aggregation function. In the following code chunk, I’ve added two variants of this function.
- The first version takes a SQL-like conditional statement and a column name.
- The second version takes a PySpark-like conditional statement, and a column reference.
from pyspark.sql import SparkSession from pyspark.sql import functions as F sumif = lambda sql_expr, col_name: F.sum(F.expr(f"CASE WHEN {sql_expr} THEN {col_name} ELSE 0 END")) sumif2 = lambda pyspark_expr, col: F.sum(F.when(pyspark_expr, col).otherwise(0))
If you want to group some data, and do an aggregation with one of the two sumif variations, this is how it’s done:
# sdf is a Spark Data Frame. sdf \ .groupby('Color') .agg( sumif("Shape = 'Triangle'", 'Surface').alias('sum_Surface'), sumif2(F.col('Shape') == 'Triangle', F.col('Surface')).alias('sum2_Surface') ) \ .show()
There we go.