**Random Forest stays my number one go-to algorithm for quickly prototyping prediction algorithms. Last week, I worked on speeding up a feature engineering and training workflow for a marketing project. I moved from the traditional randomForest package to the — already three years old — package ranger. Here are my findings.**

Let’s load the Adult dataset through the arules prackage. We only take the rows without any NAs and we take a sample of 1000 rows. This is our small data set we will use to benchmark both packages.

```
library(arules)
library(randomForest)
library(data.table)
library(microbenchmark)
library(ranger)
data('AdultUCI')
set.seed(19880303)
dt <- data.table(AdultUCI)
dt <- dt[complete.cases(dt)]
dt <- dt[sample(.N, 1000)]
```

## Training

### Output

First, let’s take a look at the output of *randomForest()*. It prints the number of trees and the amount of variables it tried at each split. It also shows the out-of-bag error rate.

On the other hand, *ranger()* provides more information. For example: the split rule.

The out-of-bag prediction error differs. Both algorithms don’t do exactly the same, so this is to be expected.

### Theory

The *randomForest()* packages uses Breiman’s Random Forest implementation, while *ranger()* borrows its theory from a wide range of implementations It’s quite clear from the *ranger* paper that a lot of methodological choices have been made with speed in mind. They wanted the R package to be no slower than the speedy Random Jungle implementation in C++:

Furthermore, speed is a recurring theme throughout the paper describing the package, and the conclusion quickly elaborates on the results.

Okay, let’s run some benchmarks ourselves, shall we?

### Benchmarks

First, let’s benchmark the traditional *randomForest() *function. The mean processing time is over a second. For a data set of just 1000 rows, that’s quite a lot actually.

```
microbenchmark(
randomForest(x = dt[,1:14],
y = dt$income,
ntree = 500,
mtry = 5,
type='prob'),
times = 25, unit = 's')
```

So, let’s go with the *ranger()* function. We use exactly the same hyperparameters. Moreover, although *ranger()* has been designed with parallel processing in mind, I set the number of threads to only 1. That’s because I want to prove that *ranger()* is not only faster because of parallel processing, but also because of a more efficient way of processing the data. As you can see, our processing time more than halves.

```
microbenchmark(
ranger(dependent.variable.name = 'income',
data = dt,
num.trees = 500,
mtry = 5,
num.threads = 1),
times = 25, unit = 's')
```

Simply by adding more threads (the *num.threads* parameter), the processing time improves drastically. The following benchmark is produced using three threads.

## Predicting

Both model objects are 4 MB in size, so if there’s a difference in speed, it’s not because of the object. As you can see from the following microbenchmark, surpisingly, predicting from a *randomForest *model object is faster than predicting from a *ranger *model object. If we compare the medians, the difference is almost 30%.

## Side note

In ranger, you can quickly access multiple properties of your predictions. It was somewhat confusing at first, because I needed access to the probabilities and I ran into the following error. Apparently, I was passing the value ‘prob’ to the *type* parameter, which is not a valid value.

Error in predict.ranger.forest(forest, data, predict.all, num.trees, type, :

Error: Invalid value for ‘type’. Use ‘response’, ‘se’, ‘terminalNodes’, or ‘quantiles’.

If you need access to the prediction probabilities, you can do that as follows:

`predict(ranger_model, dt_full_test, type='response')$predictions`