Home » Another way of merging tables with data.table in R

Another way of merging tables with data.table in R

  • by
  • 3 min read

Merging a data.table in R is more or less the same as merging a regular data.frame. However, there is one difference. There’s also is a shorthand way to merge two tables that feels more data.table-esque. In this blog post I elaborate on merging tables using R’s data.table library.

If you’re merging data frames, R will evoke merge.data.frame(). If you’re merging data.tables, the function merge.data.table() is invoked. On a side note, if you’re merging a combination, it depends on what you put first.

merge(dt,df) # will use merge.data.table and return a data.table
merge(df,dt) # will use merge.data.frame and return a data.frame

To proceed with this blog post, I would like to create some example data. In this example we have users and functions. Each user has a wage and a function_id, which refers to its function title and bonus at the end of the years.

library(data.table)

user_function_id <- c('abc123','abc124','abc124','abc126','abc127','abc128','abc129')
user_name <- c('John Doe','Jane Doe','Jim Doe','Jill Doe','Jack Doe','Jessy Doe','Joe Doe')
user_wage <- c(1000,1500,1800,1750,2000,2000, 2200)

function_id <- c('abc121','abc122','abc123','abc124','abc125','abc126','abc127')
function_name <- c('Internal Sales Manager',
                   'Account Manager','Sales Manager',
                   'Marketing Manager','Business Development Manager',
                   'Key Account Manager', 'IT Manager')
function_bonus <- c(5000, 8000, 10000, 2500, 150000, 10000, 2500)

users <- data.table(FUNCTIONID = user_function_id, 
                    USERNAME = user_name,  
                    WAGE = user_wage)

functions <- data.table(FUNCTIONID = function_id, 
                        FUNCTION = function_name, 
                        BONUS = function_bonus)

setkey(users, FUNCTIONID)
setkey(functions, FUNCTIONID)

What’s specific for data.table is that, by default, tables get merged on their keys. In the documentation we can read that merge.data.table() attempts to merge…

  • at first based on the shared key columns, and if there are none,
  • then based on key columns of the first argument x and if there are none,
  • then based on the common columns between the two data.tables

Since we set the keys, the following will do exactly the same:

dt1 <- merge(x = users, y = functions, all = T, by = 'FUNCTIONID')
dt1 <- merge(x = users, y = functions, all = T, by.x = 'FUNCTIONID', by.y = 'FUNCTIONID')
dt1 <- merge(x = users, y = functions, all = T)

I set the all argument to TRUE, because I just want a complete overview of all the data. This is the equivalent of a FULL OUTER JOIN. Here’s an overview of more join types. An explanation of the cross join is for another blog post.

INNERmerge(x, y, all = F)
LEFT OUTERmerge(x, y, all.x = T)
RIGHT OUTERmerge(x, y, all.y = T)
FULL OUTERmerge(x, y, all = T)

Using the shorthand merge, we can do an inner, left outer, a right outer join too:

INNERx[y, nomatch=0]
LEFT OUTERy[x]
RIGHT OUTERx[y]

Often, to answer specific business questions, you only need one or two specific columns, and filling your RAM with a joined table is not necessary and simply inefficient. That’s why you can aggregate and create new columns in the same expression, and it feels a lot like true data.table syntax.

In the following code snippet we calculate the average yearly wage across all users. As you can see, the shorthand merge provides a lot more readability.

mean(merge(functions, users)[,AVG_YEARLY_WAGE := WAGE*12+BONUS]$AVG_YEARLY_WAGE, na.rm=T)
functions[users, mean(WAGE*12+BONUS, na.rm=T)]

The shorthand merge also allows for row-wise calculations. The following snippet will produce a table with the yearly wage per user. To add more columns than the one you use in a calculation, wrap your columns in a list.

users[functions, list(USERNAME, YEARLY_WAGE = WAGE*12+BONUS), nomatch = 0, by = .EACHI]

Congratulations on learning shorthand merge with data.table.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *