Split a data frame column on a delimiter in R

This is something I bumped into early in my pursuit to become a data scientist, during my Coursera course, somewhere in 2017. I don’t know how I fixed it back then, but it is an issue I will always remember because it took my hours to solve it. In this blog post I elaborate on splitting a data frame column on a delimiter and assigning them to new columns.

First, let’s create some dummy data. The object df_nomissing has the same amount of delimiters in every value. The object df_missing is missing a delimiter in the last value.

df_missing <- data.frame(a = c(1,2,3), b = c('abc,def,ghi','jkl,mno,pqr','stu,vwx'), stringsAsFactors = F)
df_nomissing <- data.frame(a = c(1,2,3), b = c('abc,def,ghi','jkl,mno,pqr','stu,vwx,yz'), stringsAsFactors = F)

Splitting column b on a comma is not easy, but it’s possible to do it with only base functions. Here’s one way to do it. The strsplit function returns three vectors in a list, and we assign these to a column in a data frame. This returns weird column names, so we can change it using colnames(). There are probably easier ways using base functions, but this is how I did it.

df1 <- with(df_nomissing, data.frame(a = a, b = strsplit(b, ',', fixed=T), stringsAsFactors = F, check.rows = F))
colnames(df1,c('a','first','second','third'))

Keep in mind that this only works if there are the same amount of delimiters in every value. Otherwise, you’ll run into the following error:

Error in (function (…, row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 3, 2

Using separate() from tidyr, one can easily split column values and assign them to new columns.

library(tidyr)
df2 <- separate(data = df_missing, col = b, into = c('first', 'second', 'third'))

There’s also a solution I recently discovered, using the splitstackshape package. The cSplit function from the splitstackshape package can also do it in a tidy way. By setting type.convert to false, you won’t end up with factors.

library(splitstackshape)
df3 <- cSplit(indt = df_missing, splitCols = 'b', sep = ',', type.convert = F)

cSplit also has the nice extra feature that you can split multiple columns with different delimiters. You can simply pass a vector of column names to splitCols and a vector of delimiters to sep.

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

1 thought on “Split a data frame column on a delimiter in R”

Rastrear telefone February 10, 2024 at 12:25 pm

Melhor aplicativo de controle parental para proteger seus filhos – Monitorar secretamente secreto GPS, SMS, chamadas, WhatsApp, Facebook, localização. Você pode monitorar remotamente as atividades do telefone móvel após o download e instalar o apk no telefone de destino.

Split a data frame column on a delimiter in R

Say thanks, ask questions or give feedback

1 thought on “Split a data frame column on a delimiter in R”

Leave a Reply Cancel reply

Related Posts

Starting a remote Selenium server in R

How to set the package directory in R

Counting, adding or subtracting business days in R