**In this blog post, I tackle a question that you recurringly see on a lot of boards. We are going to calculate the cumulative sum, but within a group that the rows belong to.**

*Thanks to vegetableagony for pointing out that, depending on the size of the dataset, other conclusions can be drawn. That’s why edited this post and benchmarked several solutions on a small and large dataset.*

## Small datasets

First, let’s create the data and load the packages we’ll be using throughout this post.

library(dplyr) library(data.table) data('mtcars') dt <- data.table(mtcars) df <- copy(mtcars)

We will try to get the cumulative sum of the column *hp*, grouped by *cyl* and *gear*. (I know, it makes no sense)

First, let’s try to do it with only base functions. The underused function *ave* will return the cumulative sum by group in a single vector. So you’ll have to concatenate it to your data frame. The median time is **425 microseconds.**

with( df[order(df$hp),], ave(hp, cyl, gear, FUN = cumsum) )

Next, we’ll use ddply from the plyr package. This will return the data frame with the cumulative sum by group as a new column of the data frame. When we test this, we can clearly see that this is ten times slower: **4500 microseconds**.

ddply( df[order(df$hp),], .(cyl, gear), .fun = transform, cumulative_sum = (cumsum(hp)) )

Next up, dplyr. The following chain of functions will return a tibble with the cumulative sum by group as a new column. Not as fast as the first solution, but still a lot faster than *ddply()*: **830 microseconds**.

df %>% arrange(hp) %>% group_by(hp) %>% mutate(cumulative_sum = cumsum(hp))

Finally, let’s go with data.table. I propose two solutions. The first one returns the cumulative sum by group and the columns it was grouped by. The second column adds the cumulative sum by group as a new column to the data frame. Both solutions are somewhat slow (**2200 microseconds**), which isn’t what we expect from data.table.

dt[order(hp)][,.(cumulative_sum = cumsum(hp)), by = .(cyl, gear)] dt[order(hp)][,cumulative_sum := cumsum(hp), by = .(cyl, gear)]

## Large datasets

Let’s load the *flights* dataset.

library(nycflights13) dt <- data.table(flights) df <- copy(flights)

Let’s try the *ave* solution first. Given the size of the dataset, this is a lot slower. A median time of **1160 microseconds**.

with( df[order(df$dep_delay),], ave(dep_delay, carrier, year, FUN = cumsum) )

So, let’s go with the *ddply *solution now. Compared to *ave*, it’s five to six stimes slower: **6170 microseconds**. For small datasets, it was a factor 10, now it’s only factor 6, so that’s an improvement, I guess? 🙂

ddply( df[order(df$dep_delay),], .(carrier, year), .fun = transform, cumulative_sum = (cumsum(dep_delay)) )

Next up is *dplyr*. For the small dataset, it took double the time to generate the result. Now, it’s a little faster than the *ave* solution: **980 microseconds**.

df %>% arrange(dep_delay) %>% group_by(carrier, year) %>% mutate(cumulative_sum = cumsum(dep_delay))

Finally, I’m going for *data.table* again. Surprisingly, it gets the job done in **600 microseconds**. This is what we expect, given the advertised speed of data.table.

dt[order(dep_delay)][,.(cumulative_sum = cumsum(dep_delay)), by = .(carrier, year)] dt[order(dep_delay)][,cumulative_sum := cumsum(dep_delay), by = .(carrier, year)]

## Conclusion

Conclusion, for small datasets, the base function *ave* is the fastest solution. As you can see, sometimes you have to kill your darlings. Yet for big datasets, nothing beats data.table.

Great succes!

On Reddit I show how you can get substantially faster grouped cumulative sum times using data.table especially when using larger example datasets than the fairly small mtcars sample you use here.

https://www.reddit.com/r/rstats/comments/hkk3h4/benchmarking_the_cumulative_sum_by_group_in_r/fwttjkl?utm_source=share&utm_medium=web2x

Thanks, Rob. I’m going to update the blog post to elaborate on your findings!