Classification in R: Confusion Matrix

One of the most fundamental tools for evaluating classifiers is the confusion matrix. In this blog post, I elaborate on how to generate a confusion matrix in R using the tidymodels ecosystem.

First, I create a classifier to predict the income class for the very popular adult data set. From loading to preprocessing to modeling I use data.table, tidymodels and ranger.

If you’re not interested in how I got to the point of evaluating a model, and you’re only interested in producing a confusion matrix, don’t hesitate to skip it by scrolling down.

library(datasets)
library(data.table)
library(tidymodels)
library(ranger)

df <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')

colnames(df) <- c('age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
                  'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
                  'hours-per-week', 'native-country','y')
df$target <- 1
df[y == "<=50K"]$target <- 0
df$y <- NULL
df$target <- as.factor(df$target)

df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)

model <- rand_forest(trees = 100, mode = "classification") %>%
  set_engine("ranger") %>%
  fit(target ~ age + fnlwgt + `education-num` + race + sex + `capital-gain` +  `capital-loss` + `hours-per-week`, data = df_train)

What’s next? We’d like to use our model to predict the target on the test data and compare those predictions to the ground truth value, wouldn’t we?

If we want to continue using the tidyverse ecosystem, then we’re going to rely on the conf_mat function from yardstick, which is loaded by the tidymodels package. This function works as follows, you provide it:

the data frame;
the ground truth column;
the prediction column

Getting the predictions and the ground truth in the same table is easy if you follow tidymodels’ guidelines. By piping the prediction with dplyr’s bind_cols, you’re making life easier for yourself, because you’ll end up with a data frame that is in the right format to feed to the conf_mat function.

In the code chunk below, we create the “pred_test” object, which is a data frame that contains all features and the predictions. Finally, we feed that object to conf_mat().

pred_test <- model %>% predict(df_test) %>% bind_cols(df_test)
conf_mat(pred_test, target, .pred_class)

And this is the result:

FYI: you can modify the labels (Prediction, Truth) using the dnn argument of conf_mat().

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

3 thoughts on “Classification in R: Confusion Matrix”

20bet September 13, 2023 at 7:16 pm

Your article gave me a lot of inspiration, I hope you can explain your point of view in more detail, because I have some doubts, thank you.

binance signup July 26, 2024 at 6:31 am

Your point of view caught my eye and was very interesting. Thanks. I have a question for you.

rrinimabi August 19, 2024 at 6:57 am

Thank you very much for sharing, I learned a lot from your article. Very cool. Thanks.

Classification in R: Confusion Matrix

Say thanks, ask questions or give feedback

3 thoughts on “Classification in R: Confusion Matrix”

Leave a Reply Cancel reply

Related Posts

Starting a remote Selenium server in R

How to set the package directory in R

Counting, adding or subtracting business days in R