Skip to content
Home » Split a data frame column on a delimiter in R

Split a data frame column on a delimiter in R

  • by
  • 2 min read

This is something I bumped into early in my pursuit to become a data scientist, during my Coursera course, somewhere in 2017. I don’t know how I fixed it back then, but it is an issue I will always remember because it took my hours to solve it. In this blog post I elaborate on splitting a data frame column on a delimiter and assigning them to new columns.

First, let’s create some dummy data. The object df_nomissing has the same amount of delimiters in every value. The object df_missing is missing a delimiter in the last value.

df_missing <- data.frame(a = c(1,2,3), b = c('abc,def,ghi','jkl,mno,pqr','stu,vwx'), stringsAsFactors = F)
df_nomissing <- data.frame(a = c(1,2,3), b = c('abc,def,ghi','jkl,mno,pqr','stu,vwx,yz'), stringsAsFactors = F)

Splitting column b on a comma is not easy, but it’s possible to do it with only base functions. Here’s one way to do it. The strsplit function returns three vectors in a list, and we assign these to a column in a data frame. This returns weird column names, so we can change it using colnames(). There are probably easier ways using base functions, but this is how I did it.

df1 <- with(df_nomissing, data.frame(a = a, b = strsplit(b, ',', fixed=T), stringsAsFactors = F, check.rows = F))

Keep in mind that this only works if there are the same amount of delimiters in every value. Otherwise, you’ll run into the following error:

Error in (function (…, row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 3, 2

Using separate() from tidyr, one can easily split column values and assign them to new columns.

df2 <- separate(data = df_missing, col = b, into = c('first', 'second', 'third'))

There’s also a solution I recently discovered, using the splitstackshape package. The cSplit function from the splitstackshape package can also do it in a tidy way. By setting type.convert to false, you won’t end up with factors.

df3 <- cSplit(indt = df_missing, splitCols = 'b', sep = ',', type.convert = F)

cSplit also has the nice extra feature that you can split multiple columns with different delimiters. You can simply pass a vector of column names to splitCols and a vector of delimiters to sep.

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *