Random Forest stays my number one go-to algorithm for quickly prototyping prediction algorithms. Last week, I worked on speeding up a feature engineering and training workflow for a marketing project. I moved from the traditional randomForest package to the — already three years old — package ranger. Here are my findings.
library(arules) library(randomForest) library(data.table) library(microbenchmark) library(ranger) data('AdultUCI') set.seed(19880303) dt <- data.table(AdultUCI) dt <- dt[complete.cases(dt)] dt <- dt[sample(.N, 1000)]
First, let’s take a look at the output of randomForest(). It prints the number of trees and the amount of variables it tried at each split. It also shows the out-of-bag error rate.
On the other hand, ranger() provides more information. For example: the split rule.
The out-of-bag prediction error differs. Both algorithms don’t do exactly the same, so this is to be expected.
The randomForest() packages uses Breiman’s Random Forest implementation, while ranger() borrows its theory from a wide range of implementations It’s quite clear from the ranger paper that a lot of methodological choices have been made with speed in mind. They wanted the R package to be no slower than the speedy Random Jungle implementation in C++:
Furthermore, speed is a recurring theme throughout the paper describing the package, and the conclusion quickly elaborates on the results.
Okay, let’s run some benchmarks ourselves, shall we?
First, let’s benchmark the traditional randomForest() function. The mean processing time is over a second. For a data set of just 1000 rows, that’s quite a lot actually.
microbenchmark( randomForest(x = dt[,1:14], y = dt$income, ntree = 500, mtry = 5, type='prob'), times = 25, unit = 's')
So, let’s go with the ranger() function. We use exactly the same hyperparameters. Moreover, although ranger() has been designed with parallel processing in mind, I set the number of threads to only 1. That’s because I want to prove that ranger() is not only faster because of parallel processing, but also because of a more efficient way of processing the data. As you can see, our processing time more than halves.
microbenchmark( ranger(dependent.variable.name = 'income', data = dt, num.trees = 500, mtry = 5, num.threads = 1), times = 25, unit = 's')
Simply by adding more threads (the num.threads parameter), the processing time improves drastically. The following benchmark is produced using three threads.
Both model objects are 4 MB in size, so if there’s a difference in speed, it’s not because of the object. As you can see from the following microbenchmark, surpisingly, predicting from a randomForest model object is faster than predicting from a ranger model object. If we compare the medians, the difference is almost 30%.
In ranger, you can quickly access multiple properties of your predictions. It was somewhat confusing at first, because I needed access to the probabilities and I ran into the following error. Apparently, I was passing the value ‘prob’ to the type parameter, which is not a valid value.
Error in predict.ranger.forest(forest, data, predict.all, num.trees, type, :
Error: Invalid value for ‘type’. Use ‘response’, ‘se’, ‘terminalNodes’, or ‘quantiles’.
If you need access to the prediction probabilities, you can do that as follows:
predict(ranger_model, dt_full_test, type='response')$predictions
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!