Home » Creating and managing a list of dataframes in R

Creating and managing a list of dataframes in R

  • by
  • 2 min read

Why do people put data in a list in the first place? Because it can be really darn handy. In this blog post I elaborate on some good use cases for putting data frames in a list.

Loading a lot of files

In many situations you will be confronted with a lot of flat files that contain the same data, but for another period, or another department. In the following two examples, we load in multiple CSV files. Bot examples use data.table. And so should you.

In the first example, I use a simple for loop to go over all the files. There’s no straightforward way to enumerate over the files (like in Python), so that’s why I use i as the iteration counter. Every loop I load the contents of the CSV files using fread() and at the same time, I assign an extra column that contains the filename. This is inserted as a list item into df_list. Finally, I use rbindlist() to put alle the date frames into one big data frame.

library(data.table)
df_list <- list()

for (i in 1:length(list.files())) {
  df_list[[i]] <- fread(list.files()[i])[,FILE := list.files()[i]]
}

df <- rbindlist(df_list)

You can achieve exactly the same by using a recursive function. It’s somewhat longer, but it’s not a boring for-loop. In the following example I create a function that keeps calling itself until all files have been loaded.

library(data.table)

df_list <- list()
load_csvs <- function(dfl,i = 1) {
  if (i <= length(list.files())) {
    dfl[[i]] <- fread(list.files()[i])[,FILE := list.files()[i]]
    load_csvs(dfl, i + 1)
  } else {
    return(dfl)
  }
}

df <- rbindlist(load_csvs(df_list))

Operations on multiple data frames

While having all your data frames together in one big data frame is handy. You might want to keep them separated. Even in that situation, it’s possible to vectorize your operations. Using the lapply function and data.table syntax, I create a new column in all the data frames that exist within my list variable df_list.

lapply(df_list, function(x) { x[,NEW_COLUMN := FIRST_COLUMN + 1]})

You can achieve exactly the same using purrr‘s map function.

map(df_list,function(x) { x[,NEW_COLUMN := FIRST_COLUMN + 1]})

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *