**One of the most frequent used Excel functions is probably SUMIF and its SUMIFS variant. In this article, you’ll learn how to do exactly the same in PySpark.**

## What is the *sumif* function?

In Excel, the SUMIF function is an aggregation function for summing values from a column, but only those values for which a condition is true. This condition does not necessarily need to relate to the values, it can also relate to values in other columns.

In the table below, a SUMIF of column *Surface* where *Shape* is “Triangle” would result in 11.

Color | Shape | Surface |

Red | Triangle | 5 |

Blue | Square | 10 |

Green | Triangle | 6 |

## Using *sumif* in PySpark

To achieve the same thing in PySpark, one needs to combine a *when* with the *sum* aggregation function. In the following code chunk, I’ve added two variants of this function.

- The first version takes a SQL-like conditional statement and a column name.
- The second version takes a PySpark-like conditional statement, and a column reference.

from pyspark.sql import SparkSession from pyspark.sql import functions as F sumif = lambda sql_expr, col_name: F.sum(F.expr(f"CASE WHEN {sql_expr} THEN {col_name} ELSE 0 END")) sumif2 = lambda pyspark_expr, col: F.sum(F.when(pyspark_expr, col).otherwise(0))

If you want to group some data, and do an aggregation with one of the two *sumif* variations, this is how it’s done:

# sdf is a Spark Data Frame. sdf \ .groupby('Color') .agg( sumif("Shape = 'Triangle'", 'Surface').alias('sum_Surface'), sumif2(F.col('Shape') == 'Triangle', F.col('Surface')).alias('sum2_Surface') ) \ .show()

There we go.