Dealing with right-censored data in machine learning: Random Survival Forests

A couple of weeks ago, I started working with survival analysis. It was fairly new to me, so I had to dig into some new methods. There was one method that captured my attention: random survival forests (RSFs). It’s one of many statistical learning techniques designed to work with right-censored survival data. In this blog post I present a condensed primer on RSFs and how you can use them in R.

Although I explain all concepts or link to adequate documentation, this blog post will be more meaningful if you have prior knowledge of survival analysis, probability and tree-based machine learning methods.

The theory behind Random Survival Forests

Random forests introduce two items of randomness into decision tree methods (CART), to deal with trees’ inherent greediness. First, a number of decision trees is built on an equal number of bootstrapped training data samples. Second, a random samples of predictors is chosen as split candidates from the full set of predictors. By working with a subset of predictors, other predictors than the strongest predictor actually get a chance to be introduced into the model and the trees become decorrelated.

Although an off-the-shelf survival analysis is possible within a CART paradigm, Ishwaran et al. developed a Random Survival Forests that takes into account both survival time and censoring status.

These five steps are at the core of random survival forests (RSFs):

An essential element of RSFs is the Cumulative Hazard Function (CHF), or the probability of failure at time t given survival until time t — it is the integral of the hazard function. In RSFs, the CHF for each terminal node h is the Nelson-Aalen estimator. This estimator is built on N bootstrapped samples and evaluated on out-of-bag data. Here’s how it works: A CHF gets constructed for each bagged tree. Then, take the average of all these CHFs. Given the (ensemble) CHF, the (ensemble) mortality is estimated: the estimated value for the CHF summed over time.

To estimate the prediction error of a model, Ishwaran et al. use the C-index. It does not depend on a single fixed time for evaluation and accounts for censoring.

In the following example, we train an algorithm using Random Survival Forests from the ranger package, which is an implementation of Ishwaran et al. ‘s paper from 2008.

Putting Random Survival Forests to work in R

First, we load in all the required packages. We get our veteran dataset from the survival package. We use data.table as general framework. The ranger package will be used to train the RSF model. Finally, caret is used to make a confusion matrix.

rm(list=ls())

library(survival)
library(data.table)
library(ranger)
library(caret)

set.seed(19880303)

data(veteran)
dt <- data.table(veteran)
rm(veteran)

Next, we split the data in a training and test set.

ind <- sample(1:nrow(dt),round(nrow(dt) * 0.7,0))

dt_train <- dt[ind,]
dt_test <- dt[!ind,]

Next, we use the ranger packages to train the model. I also plot the survival curves for two cases (row 20 and row 21).

# Ranger
r_fit <- ranger(Surv(time, status) ~ .,
                data = dt_train,
                mtry = 3,
                verbose = TRUE,
                write.forest=TRUE,
                num.trees= 1000,
                importance = 'permutation')

plot(r_fit$unique.death.times, r_fit$survival[20,], type = 'l', col = 'red')
lines(r_fit$unique.death.times, r_fit$survival[21,], type = 'l', col = 'blue')

In the following chunk of code, I calculate the accuracy of the model when it needs to predict survival after 61 days (this is purely arbitrary for demonstrational purposes).

preds <- predict(r_fit, dt_test, type = 'response')$survival
preds <- data.table(preds)
colnames(preds) <- as.character(r_fit$unique.death.times)

prediction <- preds$`61` > 0.5
real <- dt_test$time >= 61

caret::confusionMatrix(as.factor(prediction), as.factor(real), positive = 'TRUE')

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

I hope you learned something!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

3 thoughts on “Dealing with right-censored data in machine learning: Random Survival Forests”

Getahun Mulugeta October 4, 2022 at 3:19 pm

To predict graft survival, I want to compare random survival forest with cox models (univariate- and lasso-based) for rare event survival analysis. Which is better for my case, ranger or RandomForestSRC? Possibly in terms of relaxing the rule to calculate all metrics for right-censored data.

ultrasonic probes ndt July 27, 2023 at 6:26 pm

Ι was impressed by the side mount connector it cаn bbe uѕeԁ to be а toρ mount connector, providing flexibility іn thｅ usе of thiѕ transducer.

Rastrear telefone February 10, 2024 at 1:26 pm

Melhor aplicativo de controle parental para proteger seus filhos – Monitorar secretamente secreto GPS, SMS, chamadas, WhatsApp, Facebook, localização. Você pode monitorar remotamente as atividades do telefone móvel após o download e instalar o apk no telefone de destino. https://www.mycellspy.com/br/

Dealing with right-censored data in machine learning: Random Survival Forests

The theory behind Random Survival Forests

Putting Random Survival Forests to work in R

Say thanks, ask questions or give feedback

3 thoughts on “Dealing with right-censored data in machine learning: Random Survival Forests”

Leave a Reply Cancel reply

Related Posts

Starting a remote Selenium server in R

How to set the package directory in R

Counting, adding or subtracting business days in R