In this blog post, I tackle a question that you recurringly see on a lot of boards. We are going to calculate the cumulative sum, but within a group that the rows belong to.
Thanks to vegetableagony for pointing out that, depending on the size of the dataset, other conclusions can be drawn. That’s why edited this post and benchmarked several solutions on a small and large dataset.
First, let’s create the data and load the packages we’ll be using throughout this post.
library(dplyr) library(data.table) data('mtcars') dt <- data.table(mtcars) df <- copy(mtcars)
We will try to get the cumulative sum of the column hp, grouped by cyl and gear. (I know, it makes no sense)
First, let’s try to do it with only base functions. The underused function ave will return the cumulative sum by group in a single vector. So you’ll have to concatenate it to your data frame. The median time is 425 microseconds.
with( df[order(df$hp),], ave(hp, cyl, gear, FUN = cumsum) )
Next, we’ll use ddply from the plyr package. This will return the data frame with the cumulative sum by group as a new column of the data frame. When we test this, we can clearly see that this is ten times slower: 4500 microseconds.
ddply( df[order(df$hp),], .(cyl, gear), .fun = transform, cumulative_sum = (cumsum(hp)) )
Next up, dplyr. The following chain of functions will return a tibble with the cumulative sum by group as a new column. Not as fast as the first solution, but still a lot faster than ddply(): 830 microseconds.
df %>% arrange(hp) %>% group_by(hp) %>% mutate(cumulative_sum = cumsum(hp))
Finally, let’s go with data.table. I propose two solutions. The first one returns the cumulative sum by group and the columns it was grouped by. The second column adds the cumulative sum by group as a new column to the data frame. Both solutions are somewhat slow (2200 microseconds), which isn’t what we expect from data.table.
dt[order(hp)][,.(cumulative_sum = cumsum(hp)), by = .(cyl, gear)] dt[order(hp)][,cumulative_sum := cumsum(hp), by = .(cyl, gear)]
Let’s load the flights dataset.
library(nycflights13) dt <- data.table(flights) df <- copy(flights)
Let’s try the ave solution first. Given the size of the dataset, this is a lot slower. A median time of 1160 microseconds.
with( df[order(df$dep_delay),], ave(dep_delay, carrier, year, FUN = cumsum) )
So, let’s go with the ddply solution now. Compared to ave, it’s five to six stimes slower: 6170 microseconds. For small datasets, it was a factor 10, now it’s only factor 6, so that’s an improvement, I guess? 🙂
ddply( df[order(df$dep_delay),], .(carrier, year), .fun = transform, cumulative_sum = (cumsum(dep_delay)) )
Next up is dplyr. For the small dataset, it took double the time to generate the result. Now, it’s a little faster than the ave solution: 980 microseconds.
df %>% arrange(dep_delay) %>% group_by(carrier, year) %>% mutate(cumulative_sum = cumsum(dep_delay))
Finally, I’m going for data.table again. Surprisingly, it gets the job done in 600 microseconds. This is what we expect, given the advertised speed of data.table.
dt[order(dep_delay)][,.(cumulative_sum = cumsum(dep_delay)), by = .(carrier, year)] dt[order(dep_delay)][,cumulative_sum := cumsum(dep_delay), by = .(carrier, year)]
Conclusion, for small datasets, the base function ave is the fastest solution. As you can see, sometimes you have to kill your darlings. Yet for big datasets, nothing beats data.table.
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!