Merging a data.table in R is more or less the same as merging a regular data.frame. However, there is one difference. There’s also is a shorthand way to merge two tables that feels more data.table-esque. In this blog post I elaborate on merging tables using R’s data.table library.
If you’re merging data frames, R will evoke merge.data.frame(). If you’re merging data.tables, the function merge.data.table() is invoked. On a side note, if you’re merging a combination, it depends on what you put first.
merge(dt,df) # will use merge.data.table and return a data.table merge(df,dt) # will use merge.data.frame and return a data.frame
To proceed with this blog post, I would like to create some example data. In this example we have users and functions. Each user has a wage and a function_id, which refers to its function title and bonus at the end of the years.
library(data.table) user_function_id <- c('abc123','abc124','abc124','abc126','abc127','abc128','abc129') user_name <- c('John Doe','Jane Doe','Jim Doe','Jill Doe','Jack Doe','Jessy Doe','Joe Doe') user_wage <- c(1000,1500,1800,1750,2000,2000, 2200) function_id <- c('abc121','abc122','abc123','abc124','abc125','abc126','abc127') function_name <- c('Internal Sales Manager', 'Account Manager','Sales Manager', 'Marketing Manager','Business Development Manager', 'Key Account Manager', 'IT Manager') function_bonus <- c(5000, 8000, 10000, 2500, 150000, 10000, 2500) users <- data.table(FUNCTIONID = user_function_id, USERNAME = user_name, WAGE = user_wage) functions <- data.table(FUNCTIONID = function_id, FUNCTION = function_name, BONUS = function_bonus) setkey(users, FUNCTIONID) setkey(functions, FUNCTIONID)
What’s specific for data.table is that, by default, tables get merged on their keys. In the documentation we can read that merge.data.table() attempts to merge…
- at first based on the shared key columns, and if there are none,
- then based on key columns of the first argument x and if there are none,
- then based on the common columns between the two data.tables
Since we set the keys, the following will do exactly the same:
dt1 <- merge(x = users, y = functions, all = T, by = 'FUNCTIONID') dt1 <- merge(x = users, y = functions, all = T, by.x = 'FUNCTIONID', by.y = 'FUNCTIONID') dt1 <- merge(x = users, y = functions, all = T)
I set the all argument to TRUE, because I just want a complete overview of all the data. This is the equivalent of a FULL OUTER JOIN. Here’s an overview of more join types. An explanation of the cross join is for another blog post.
|INNER||merge(x, y, all = F)|
|LEFT OUTER||merge(x, y, all.x = T)|
|RIGHT OUTER||merge(x, y, all.y = T)|
|FULL OUTER||merge(x, y, all = T)|
Using the shorthand merge, we can do an inner, left outer, a right outer join too:
Often, to answer specific business questions, you only need one or two specific columns, and filling your RAM with a joined table is not necessary and simply inefficient. That’s why you can aggregate and create new columns in the same expression, and it feels a lot like true data.table syntax.
In the following code snippet we calculate the average yearly wage across all users. As you can see, the shorthand merge provides a lot more readability.
mean(merge(functions, users)[,AVG_YEARLY_WAGE := WAGE*12+BONUS]$AVG_YEARLY_WAGE, na.rm=T) functions[users, mean(WAGE*12+BONUS, na.rm=T)]
The shorthand merge also allows for row-wise calculations. The following snippet will produce a table with the yearly wage per user. To add more columns than the one you use in a calculation, wrap your columns in a list.
users[functions, list(USERNAME, YEARLY_WAGE = WAGE*12+BONUS), nomatch = 0, by = .EACHI]
Congratulations on learning shorthand merge with data.table.