Home » Another way of merging tables with data.table in R

Another way of merging tables with data.table in R

  • by
  • 3 min read

Merging a data.table in R is more or less the same as merging a regular data.frame. However, there is one difference. There’s also is a shorthand way to merge two tables that feels more data.table-esque. In this blog post I elaborate on merging tables using R’s data.table library.

If you’re merging data frames, R will evoke merge.data.frame(). If you’re merging data.tables, the function merge.data.table() is invoked. On a side note, if you’re merging a combination, it depends on what you put first.

merge(dt,df) # will use merge.data.table and return a data.table
merge(df,dt) # will use merge.data.frame and return a data.frame

To proceed with this blog post, I would like to create some example data. In this example we have users and functions. Each user has a wage and a function_id, which refers to its function title and bonus at the end of the years.

library(data.table)

user_function_id <- c('abc123','abc124','abc124','abc126','abc127','abc128','abc129')
user_name <- c('John Doe','Jane Doe','Jim Doe','Jill Doe','Jack Doe','Jessy Doe','Joe Doe')
user_wage <- c(1000,1500,1800,1750,2000,2000, 2200)

function_id <- c('abc121','abc122','abc123','abc124','abc125','abc126','abc127')
function_name <- c('Internal Sales Manager',
                   'Account Manager','Sales Manager',
                   'Marketing Manager','Business Development Manager',
                   'Key Account Manager', 'IT Manager')
function_bonus <- c(5000, 8000, 10000, 2500, 150000, 10000, 2500)

users <- data.table(FUNCTIONID = user_function_id, 
                    USERNAME = user_name,  
                    WAGE = user_wage)

functions <- data.table(FUNCTIONID = function_id, 
                        FUNCTION = function_name, 
                        BONUS = function_bonus)

setkey(users, FUNCTIONID)
setkey(functions, FUNCTIONID)

What’s specific for data.table is that, by default, tables get merged on their keys. In the documentation we can read that merge.data.table() attempts to merge…

  • at first based on the shared key columns, and if there are none,
  • then based on key columns of the first argument x and if there are none,
  • then based on the common columns between the two data.tables

Since we set the keys, the following will do exactly the same:

dt1 <- merge(x = users, y = functions, all = T, by = 'FUNCTIONID')
dt1 <- merge(x = users, y = functions, all = T, by.x = 'FUNCTIONID', by.y = 'FUNCTIONID')
dt1 <- merge(x = users, y = functions, all = T)

I set the all argument to TRUE, because I just want a complete overview of all the data. This is the equivalent of a FULL OUTER JOIN. Here’s an overview of more join types. An explanation of the cross join is for another blog post.

INNERmerge(x, y, all = F)
LEFT OUTERmerge(x, y, all.x = T)
RIGHT OUTERmerge(x, y, all.y = T)
FULL OUTERmerge(x, y, all = T)

Using the shorthand merge, we can do an inner, left outer, a right outer join too:

INNERx[y, nomatch=0]
LEFT OUTERy[x]
RIGHT OUTERx[y]

Often, to answer specific business questions, you only need one or two specific columns, and filling your RAM with a joined table is not necessary and simply inefficient. That’s why you can aggregate and create new columns in the same expression, and it feels a lot like true data.table syntax.

In the following code snippet we calculate the average yearly wage across all users. As you can see, the shorthand merge provides a lot more readability.

mean(merge(functions, users)[,AVG_YEARLY_WAGE := WAGE*12+BONUS]$AVG_YEARLY_WAGE, na.rm=T)
functions[users, mean(WAGE*12+BONUS, na.rm=T)]

The shorthand merge also allows for row-wise calculations. The following snippet will produce a table with the yearly wage per user. To add more columns than the one you use in a calculation, wrap your columns in a list.

users[functions, list(USERNAME, YEARLY_WAGE = WAGE*12+BONUS), nomatch = 0, by = .EACHI]

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Congratulations on learning shorthand merge with data.table.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *