When speed matters: going from randomForest to ranger

Random Forest stays my number one go-to algorithm for quickly prototyping prediction algorithms. Last week, I worked on speeding up a feature engineering and training workflow for a marketing project. I moved from the traditional randomForest package to the — already three years old — package ranger. Here are my findings.

Let’s load the Adult dataset through the arules prackage. We only take the rows without any NAs and we take a sample of 1000 rows. This is our small data set we will use to benchmark both packages.

library(arules)
library(randomForest)
library(data.table)
library(microbenchmark)
library(ranger)
data('AdultUCI')
set.seed(19880303)
dt <- data.table(AdultUCI)
dt <- dt[complete.cases(dt)]
dt <- dt[sample(.N, 1000)]

Training

Output

First, let’s take a look at the output of randomForest(). It prints the number of trees and the amount of variables it tried at each split. It also shows the out-of-bag error rate.

On the other hand, ranger() provides more information. For example: the split rule.

The out-of-bag prediction error differs. Both algorithms don’t do exactly the same, so this is to be expected.

Theory

The randomForest() packages uses Breiman’s Random Forest implementation, while ranger() borrows its theory from a wide range of implementations It’s quite clear from the ranger paper that a lot of methodological choices have been made with speed in mind. They wanted the R package to be no slower than the speedy Random Jungle implementation in C++:

Wright, M. N. & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17.

Furthermore, speed is a recurring theme throughout the paper describing the package, and the conclusion quickly elaborates on the results.

Okay, let’s run some benchmarks ourselves, shall we?

Benchmarks

First, let’s benchmark the traditional randomForest() function. The mean processing time is over a second. For a data set of just 1000 rows, that’s quite a lot actually.

microbenchmark(
  randomForest(x = dt[,1:14], 
               y = dt$income, 
               ntree = 500, 
               mtry = 5, 
               type='prob'),
  times = 25, unit = 's')

So, let’s go with the ranger() function. We use exactly the same hyperparameters. Moreover, although ranger() has been designed with parallel processing in mind, I set the number of threads to only 1. That’s because I want to prove that ranger() is not only faster because of parallel processing, but also because of a more efficient way of processing the data. As you can see, our processing time more than halves.

microbenchmark(
  ranger(dependent.variable.name = 'income', 
         data = dt, 
         num.trees = 500, 
         mtry = 5, 
         num.threads = 1),
  times = 25, unit = 's')

Simply by adding more threads (the num.threads parameter), the processing time improves drastically. The following benchmark is produced using three threads.

Predicting

Both model objects are 4 MB in size, so if there’s a difference in speed, it’s not because of the object. As you can see from the following microbenchmark, surpisingly, predicting from a randomForest model object is faster than predicting from a ranger model object. If we compare the medians, the difference is almost 30%.

Side note

In ranger, you can quickly access multiple properties of your predictions. It was somewhat confusing at first, because I needed access to the probabilities and I ran into the following error. Apparently, I was passing the value ‘prob’ to the type parameter, which is not a valid value.

Error in predict.ranger.forest(forest, data, predict.all, num.trees, type, :
Error: Invalid value for ‘type’. Use ‘response’, ‘se’, ‘terminalNodes’, or ‘quantiles’.

If you need access to the prediction probabilities, you can do that as follows:

predict(ranger_model, dt_full_test, type='response')$predictions

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

3 thoughts on “When speed matters: going from randomForest to ranger”

Juan July 19, 2022 at 7:32 pm

How does it compare to the package party?

Anm"al dig f"or att fa 100 USDT March 24, 2024 at 2:41 pm

Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me?

binance Registro May 19, 2024 at 3:33 am

Your point of view caught my eye and was very interesting. Thanks. I have a question for you.

When speed matters: going from randomForest to ranger

Training

Output

Theory

Benchmarks

Predicting

Side note

Say thanks, ask questions or give feedback

3 thoughts on “When speed matters: going from randomForest to ranger”

Leave a Reply Cancel reply

Related Posts

Starting a remote Selenium server in R

How to set the package directory in R

Counting, adding or subtracting business days in R