**In this blog post, I tackle a question that you recurringly see on a lot of boards. We are going to calculate the cumulative sum, but within a group that the rows belong to.**

*Thanks to vegetableagony for pointing out that, depending on the size of the dataset, other conclusions can be drawn. That’s why edited this post and benchmarked several solutions on a small and large dataset.*

## Small datasets

First, let’s create the data and load the packages we’ll be using throughout this post.

library(dplyr) library(data.table) data('mtcars') dt <- data.table(mtcars) df <- copy(mtcars)

We will try to get the cumulative sum of the column *hp*, grouped by *cyl* and *gear*. (I know, it makes no sense)

First, let’s try to do it with only base functions. The underused function *ave* will return the cumulative sum by group in a single vector. So you’ll have to concatenate it to your data frame. The median time is **425 microseconds.**

with( df[order(df$hp),], ave(hp, cyl, gear, FUN = cumsum) )

Next, we’ll use ddply from the plyr package. This will return the data frame with the cumulative sum by group as a new column of the data frame. When we test this, we can clearly see that this is ten times slower: **4500 microseconds**.

ddply( df[order(df$hp),], .(cyl, gear), .fun = transform, cumulative_sum = (cumsum(hp)) )

Next up, dplyr. The following chain of functions will return a tibble with the cumulative sum by group as a new column. Not as fast as the first solution, but still a lot faster than *ddply()*: **830 microseconds**.

df %>% arrange(hp) %>% group_by(hp) %>% mutate(cumulative_sum = cumsum(hp))

Finally, let’s go with data.table. I propose two solutions. The first one returns the cumulative sum by group and the columns it was grouped by. The second column adds the cumulative sum by group as a new column to the data frame. Both solutions are somewhat slow (**2200 microseconds**), which isn’t what we expect from data.table.

dt[order(hp)][,.(cumulative_sum = cumsum(hp)), by = .(cyl, gear)] dt[order(hp)][,cumulative_sum := cumsum(hp), by = .(cyl, gear)]

## Large datasets

Let’s load the *flights* dataset.

library(nycflights13) dt <- data.table(flights) df <- copy(flights)

Let’s try the *ave* solution first. Given the size of the dataset, this is a lot slower. A median time of **1160 microseconds**.

with( df[order(df$dep_delay),], ave(dep_delay, carrier, year, FUN = cumsum) )

So, let’s go with the *ddply *solution now. Compared to *ave*, it’s five to six stimes slower: **6170 microseconds**. For small datasets, it was a factor 10, now it’s only factor 6, so that’s an improvement, I guess? 🙂

ddply( df[order(df$dep_delay),], .(carrier, year), .fun = transform, cumulative_sum = (cumsum(dep_delay)) )

Next up is *dplyr*. For the small dataset, it took double the time to generate the result. Now, it’s a little faster than the *ave* solution: **980 microseconds**.

df %>% arrange(dep_delay) %>% group_by(carrier, year) %>% mutate(cumulative_sum = cumsum(dep_delay))

Finally, I’m going for *data.table* again. Surprisingly, it gets the job done in **600 microseconds**. This is what we expect, given the advertised speed of data.table.

dt[order(dep_delay)][,.(cumulative_sum = cumsum(dep_delay)), by = .(carrier, year)] dt[order(dep_delay)][,cumulative_sum := cumsum(dep_delay), by = .(carrier, year)]

## Conclusion

Conclusion, for small datasets, the base function *ave* is the fastest solution. As you can see, sometimes you have to kill your darlings. Yet for big datasets, nothing beats data.table.

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Great succes!

On Reddit I show how you can get substantially faster grouped cumulative sum times using data.table especially when using larger example datasets than the fairly small mtcars sample you use here.

https://www.reddit.com/r/rstats/comments/hkk3h4/benchmarking_the_cumulative_sum_by_group_in_r/fwttjkl?utm_source=share&utm_medium=web2x

Thanks, Rob. I’m going to update the blog post to elaborate on your findings!