Random Forest stays my number one go-to algorithm for quickly prototyping prediction algorithms. Last week, I worked on speeding up a feature engineering and training workflow for a marketing project. I moved from the traditional randomForest package to the — already three years old — package ranger. Here are my findings.
Let’s load the Adult dataset through the arules prackage. We only take the rows without any NAs and we take a sample of 1000 rows. This is our small data set we will use to benchmark both packages.
library(arules)
library(randomForest)
library(data.table)
library(microbenchmark)
library(ranger)
data('AdultUCI')
set.seed(19880303)
dt <- data.table(AdultUCI)
dt <- dt[complete.cases(dt)]
dt <- dt[sample(.N, 1000)]
Training
Output
First, let’s take a look at the output of randomForest(). It prints the number of trees and the amount of variables it tried at each split. It also shows the out-of-bag error rate.

On the other hand, ranger() provides more information. For example: the split rule.

The out-of-bag prediction error differs. Both algorithms don’t do exactly the same, so this is to be expected.
Theory
The randomForest() packages uses Breiman’s Random Forest implementation, while ranger() borrows its theory from a wide range of implementations It’s quite clear from the ranger paper that a lot of methodological choices have been made with speed in mind. They wanted the R package to be no slower than the speedy Random Jungle implementation in C++:

Furthermore, speed is a recurring theme throughout the paper describing the package, and the conclusion quickly elaborates on the results.

Okay, let’s run some benchmarks ourselves, shall we?
Benchmarks
First, let’s benchmark the traditional randomForest() function. The mean processing time is over a second. For a data set of just 1000 rows, that’s quite a lot actually.
microbenchmark(
randomForest(x = dt[,1:14],
y = dt$income,
ntree = 500,
mtry = 5,
type='prob'),
times = 25, unit = 's')

So, let’s go with the ranger() function. We use exactly the same hyperparameters. Moreover, although ranger() has been designed with parallel processing in mind, I set the number of threads to only 1. That’s because I want to prove that ranger() is not only faster because of parallel processing, but also because of a more efficient way of processing the data. As you can see, our processing time more than halves.
microbenchmark(
ranger(dependent.variable.name = 'income',
data = dt,
num.trees = 500,
mtry = 5,
num.threads = 1),
times = 25, unit = 's')

Simply by adding more threads (the num.threads parameter), the processing time improves drastically. The following benchmark is produced using three threads.

Predicting
Both model objects are 4 MB in size, so if there’s a difference in speed, it’s not because of the object. As you can see from the following microbenchmark, surpisingly, predicting from a randomForest model object is faster than predicting from a ranger model object. If we compare the medians, the difference is almost 30%.

Side note
In ranger, you can quickly access multiple properties of your predictions. It was somewhat confusing at first, because I needed access to the probabilities and I ran into the following error. Apparently, I was passing the value ‘prob’ to the type parameter, which is not a valid value.
Error in predict.ranger.forest(forest, data, predict.all, num.trees, type, :
Error: Invalid value for ‘type’. Use ‘response’, ‘se’, ‘terminalNodes’, or ‘quantiles’.
If you need access to the prediction probabilities, you can do that as follows:
predict(ranger_model, dt_full_test, type='response')$predictions
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!
How does it compare to the package party?
Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me?
Your point of view caught my eye and was very interesting. Thanks. I have a question for you.
Your article helped me a lot, is there any more related content? Thanks!