Merging a data.table in R is more or less the same as merging a regular data.frame. However, there is one difference. There’s also is a shorthand way to merge two tables that feels more data.table-esque. In this blog post I elaborate on merging tables using R’s data.table library.
If you’re merging data frames, R will evoke merge.data.frame(). If you’re merging data.tables, the function merge.data.table() is invoked. On a side note, if you’re merging a combination, it depends on what you put first.
merge(dt,df) # will use merge.data.table and return a data.table
merge(df,dt) # will use merge.data.frame and return a data.frame
To proceed with this blog post, I would like to create some example data. In this example we have users and functions. Each user has a wage and a function_id, which refers to its function title and bonus at the end of the years.
library(data.table)
user_function_id <- c('abc123','abc124','abc124','abc126','abc127','abc128','abc129')
user_name <- c('John Doe','Jane Doe','Jim Doe','Jill Doe','Jack Doe','Jessy Doe','Joe Doe')
user_wage <- c(1000,1500,1800,1750,2000,2000, 2200)
function_id <- c('abc121','abc122','abc123','abc124','abc125','abc126','abc127')
function_name <- c('Internal Sales Manager',
'Account Manager','Sales Manager',
'Marketing Manager','Business Development Manager',
'Key Account Manager', 'IT Manager')
function_bonus <- c(5000, 8000, 10000, 2500, 150000, 10000, 2500)
users <- data.table(FUNCTIONID = user_function_id,
USERNAME = user_name,
WAGE = user_wage)
functions <- data.table(FUNCTIONID = function_id,
FUNCTION = function_name,
BONUS = function_bonus)
setkey(users, FUNCTIONID)
setkey(functions, FUNCTIONID)
What’s specific for data.table is that, by default, tables get merged on their keys. In the documentation we can read that merge.data.table() attempts to merge…
- at first based on the shared key columns, and if there are none,
- then based on key columns of the first argument x and if there are none,
- then based on the common columns between the two data.tables
Since we set the keys, the following will do exactly the same:
dt1 <- merge(x = users, y = functions, all = T, by = 'FUNCTIONID')
dt1 <- merge(x = users, y = functions, all = T, by.x = 'FUNCTIONID', by.y = 'FUNCTIONID')
dt1 <- merge(x = users, y = functions, all = T)
I set the all argument to TRUE, because I just want a complete overview of all the data. This is the equivalent of a FULL OUTER JOIN. Here’s an overview of more join types. An explanation of the cross join is for another blog post.
INNER | merge(x, y, all = F) |
LEFT OUTER | merge(x, y, all.x = T) |
RIGHT OUTER | merge(x, y, all.y = T) |
FULL OUTER | merge(x, y, all = T) |
Using the shorthand merge, we can do an inner, left outer, a right outer join too:
INNER | x[y, nomatch=0] |
LEFT OUTER | y[x] |
RIGHT OUTER | x[y] |
Often, to answer specific business questions, you only need one or two specific columns, and filling your RAM with a joined table is not necessary and simply inefficient. That’s why you can aggregate and create new columns in the same expression, and it feels a lot like true data.table syntax.
In the following code snippet we calculate the average yearly wage across all users. As you can see, the shorthand merge provides a lot more readability.
mean(merge(functions, users)[,AVG_YEARLY_WAGE := WAGE*12+BONUS]$AVG_YEARLY_WAGE, na.rm=T)
functions[users, mean(WAGE*12+BONUS, na.rm=T)]
The shorthand merge also allows for row-wise calculations. The following snippet will produce a table with the yearly wage per user. To add more columns than the one you use in a calculation, wrap your columns in a list.
users[functions, list(USERNAME, YEARLY_WAGE = WAGE*12+BONUS), nomatch = 0, by = .EACHI]
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!
Congratulations on learning shorthand merge with data.table.