Home » When speed matters: going from randomForest to ranger

# When speed matters: going from randomForest to ranger

Random Forest stays my number one go-to algorithm for quickly prototyping prediction algorithms. Last week, I worked on speeding up a feature engineering and training workflow for a marketing project. I moved from the traditional randomForest package to the — already three years old — package ranger. Here are my findings.

Let’s load the Adult dataset through the arules prackage. We only take the rows without any NAs and we take a sample of 1000 rows. This is our small data set we will use to benchmark both packages.

library(arules)
library(randomForest)
library(data.table)
library(microbenchmark)
library(ranger)
set.seed(19880303)
dt <- dt[complete.cases(dt)]
dt <- dt[sample(.N, 1000)]

## Training

### Output

First, let’s take a look at the output of randomForest(). It prints the number of trees and the amount of variables it tried at each split. It also shows the out-of-bag error rate.

On the other hand, ranger() provides more information. For example: the split rule.

The out-of-bag prediction error differs. Both algorithms don’t do exactly the same, so this is to be expected.

### Theory

The randomForest() packages uses Breiman’s Random Forest implementation, while ranger() borrows its theory from a wide range of implementations It’s quite clear from the ranger paper that a lot of methodological choices have been made with speed in mind. They wanted the R package to be no slower than the speedy Random Jungle implementation in C++:

Furthermore, speed is a recurring theme throughout the paper describing the package, and the conclusion quickly elaborates on the results.

Okay, let’s run some benchmarks ourselves, shall we?

### Benchmarks

First, let’s benchmark the traditional randomForest() function. The mean processing time is over a second. For a data set of just 1000 rows, that’s quite a lot actually.

microbenchmark(
randomForest(x = dt[,1:14],
y = dt$income, ntree = 500, mtry = 5, type='prob'), times = 25, unit = 's') So, let’s go with the ranger() function. We use exactly the same hyperparameters. Moreover, although ranger() has been designed with parallel processing in mind, I set the number of threads to only 1. That’s because I want to prove that ranger() is not only faster because of parallel processing, but also because of a more efficient way of processing the data. As you can see, our processing time more than halves. microbenchmark( ranger(dependent.variable.name = 'income', data = dt, num.trees = 500, mtry = 5, num.threads = 1), times = 25, unit = 's') Simply by adding more threads (the num.threads parameter), the processing time improves drastically. The following benchmark is produced using three threads. ## Predicting Both model objects are 4 MB in size, so if there’s a difference in speed, it’s not because of the object. As you can see from the following microbenchmark, surpisingly, predicting from a randomForest model object is faster than predicting from a ranger model object. If we compare the medians, the difference is almost 30%. ## Side note In ranger, you can quickly access multiple properties of your predictions. It was somewhat confusing at first, because I needed access to the probabilities and I ran into the following error. Apparently, I was passing the value ‘prob’ to the type parameter, which is not a valid value. Error in predict.ranger.forest(forest, data, predict.all, num.trees, type, : Error: Invalid value for ‘type’. Use ‘response’, ‘se’, ‘terminalNodes’, or ‘quantiles’. If you need access to the prediction probabilities, you can do that as follows: predict(ranger_model, dt_full_test, type='response')$predictions

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

### Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.