**This is something I bumped into early in my pursuit to become a data scientist, during my Coursera course, somewhere in 2017. I don’t know how I fixed it back then, but it is an issue I will always remember because it took my hours to solve it. In this blog post I elaborate on splitting a data frame column on a delimiter and assigning them to new columns.**

First, let’s create some dummy data. The object *df_nomissing* has the same amount of delimiters in every value. The object *df_missing* is missing a delimiter in the last value.

df_missing <- data.frame(a = c(1,2,3), b = c('abc,def,ghi','jkl,mno,pqr','stu,vwx'), stringsAsFactors = F) df_nomissing <- data.frame(a = c(1,2,3), b = c('abc,def,ghi','jkl,mno,pqr','stu,vwx,yz'), stringsAsFactors = F)

Splitting column b on a comma is not easy, but it’s possible to do it with only base functions. Here’s one way to do it. The *strsplit* function returns three vectors in a list, and we assign these to a column in a data frame. This returns weird column names, so we can change it using *colnames()*. There are probably easier ways using base functions, but this is how I did it.

df1 <- with(df_nomissing, data.frame(a = a, b = strsplit(b, ',', fixed=T), stringsAsFactors = F, check.rows = F)) colnames(df1,c('a','first','second','third'))

Keep in mind that this only works if there are the same amount of delimiters in every value. Otherwise, you’ll run into the following error:

Error in (function (…, row.names = NULL, check.rows = FALSE, check.names = TRUE, :

arguments imply differing number of rows: 3, 2

Using *separate*() from *tidyr*, one can easily split column values and assign them to new columns.

library(tidyr) df2 <- separate(data = df_missing, col = b, into = c('first', 'second', 'third'))

There’s also a solution I recently discovered, using the *splitstackshape* package. The *cSplit* function from the *splitstackshape* package can also do it in a tidy way. By setting *type.convert* to false, you won’t end up with factors.

library(splitstackshape) df3 <- cSplit(indt = df_missing, splitCols = 'b', sep = ',', type.convert = F)

cSplit also has the nice extra feature that you can split multiple columns with different delimiters. You can simply pass a vector of column names to *splitCols* and a vector of delimiters to *sep*.

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Great success!