Home ยป When speed matters: going from randomForest to ranger

When speed matters: going from randomForest to ranger

  • by
from-randomforest-to-ranger
Want to do a random act of kindness? Share this post.

Random Forest stays my number one go-to algorithm for quickly prototyping prediction algorithms. Last week, I worked on speeding up a feature engineering and training workflow for a marketing project. I moved from the traditional randomForest package to the — already three years old — package ranger. Here are my findings.

Let’s load the Adult dataset through the arules prackage. We only take the rows without any NAs and we take a sample of 1000 rows. This is our small data set we will use to benchmark both packages.

library(arules)
library(randomForest)
library(data.table)
library(microbenchmark)
library(ranger)
data('AdultUCI')
set.seed(19880303)
dt <- data.table(AdultUCI)
dt <- dt[complete.cases(dt)]
dt <- dt[sample(.N, 1000)]

Training

Output

First, let’s take a look at the output of randomForest(). It prints the number of trees and the amount of variables it tried at each split. It also shows the out-of-bag error rate.

On the other hand, ranger() provides more information. For example: the split rule.

The out-of-bag prediction error differs. Both algorithms don’t do exactly the same, so this is to be expected.

Theory

The randomForest() packages uses Breiman’s Random Forest implementation, while ranger() borrows its theory from a wide range of implementations It’s quite clear from the ranger paper that a lot of methodological choices have been made with speed in mind. They wanted the R package to be no slower than the speedy Random Jungle implementation in C++:

Wright, M. N. & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17.

Furthermore, speed is a recurring theme throughout the paper describing the package, and the conclusion quickly elaborates on the results.

Wright, M. N. & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17.

Okay, let’s run some benchmarks ourselves, shall we?

Benchmarks

First, let’s benchmark the traditional randomForest() function. The mean processing time is over a second. For a data set of just 1000 rows, that’s quite a lot actually.

microbenchmark(
  randomForest(x = dt[,1:14], 
               y = dt$income, 
               ntree = 500, 
               mtry = 5, 
               type='prob'),
  times = 25, unit = 's')

So, let’s go with the ranger() function. We use exactly the same hyperparameters. Moreover, although ranger() has been designed with parallel processing in mind, I set the number of threads to only 1. That’s because I want to prove that ranger() is not only faster because of parallel processing, but also because of a more efficient way of processing the data. As you can see, our processing time more than halves.

microbenchmark(
  ranger(dependent.variable.name = 'income', 
         data = dt, 
         num.trees = 500, 
         mtry = 5, 
         num.threads = 1),
  times = 25, unit = 's')

Simply by adding more threads (the num.threads parameter), the processing time improves drastically. The following benchmark is produced using three threads.

Predicting

Both model objects are 4 MB in size, so if there’s a difference in speed, it’s not because of the object. As you can see from the following microbenchmark, surpisingly, predicting from a randomForest model object is faster than predicting from a ranger model object. If we compare the medians, the difference is almost 30%.

Side note

In ranger, you can quickly access multiple properties of your predictions. It was somewhat confusing at first, because I needed access to the probabilities and I ran into the following error. Apparently, I was passing the value ‘prob’ to the type parameter, which is not a valid value.

Error in predict.ranger.forest(forest, data, predict.all, num.trees, type, :
Error: Invalid value for ‘type’. Use ‘response’, ‘se’, ‘terminalNodes’, or ‘quantiles’.

If you need access to the prediction probabilities, you can do that as follows:

predict(ranger_model, dt_full_test, type='response')$predictions

Want to do a random act of kindness? Share this post.