This is something I bumped into early in my pursuit to become a data scientist, during my Coursera course, somewhere in 2017. I don’t know how I fixed it back then, but it is an issue I will always remember because it took my hours to solve it. In this blog post I elaborate on splitting a data frame column on a delimiter and assigning them to new columns.
First, let’s create some dummy data. The object df_nomissing has the same amount of delimiters in every value. The object df_missing is missing a delimiter in the last value.
df_missing <- data.frame(a = c(1,2,3), b = c('abc,def,ghi','jkl,mno,pqr','stu,vwx'), stringsAsFactors = F) df_nomissing <- data.frame(a = c(1,2,3), b = c('abc,def,ghi','jkl,mno,pqr','stu,vwx,yz'), stringsAsFactors = F)
Splitting column b on a comma is not easy, but it’s possible to do it with only base functions. Here’s one way to do it. The strsplit function returns three vectors in a list, and we assign these to a column in a data frame. This returns weird column names, so we can change it using colnames(). There are probably easier ways using base functions, but this is how I did it.
df1 <- with(df_nomissing, data.frame(a = a, b = strsplit(b, ',', fixed=T), stringsAsFactors = F, check.rows = F)) colnames(df1,c('a','first','second','third'))
Keep in mind that this only works if there are the same amount of delimiters in every value. Otherwise, you’ll run into the following error:
Error in (function (…, row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 3, 2
Using separate() from tidyr, one can easily split column values and assign them to new columns.
library(tidyr) df2 <- separate(data = df_missing, col = b, into = c('first', 'second', 'third'))
There’s also a solution I recently discovered, using the splitstackshape package. The cSplit function from the splitstackshape package can also do it in a tidy way. By setting type.convert to false, you won’t end up with factors.
library(splitstackshape) df3 <- cSplit(indt = df_missing, splitCols = 'b', sep = ',', type.convert = F)
cSplit also has the nice extra feature that you can split multiple columns with different delimiters. You can simply pass a vector of column names to splitCols and a vector of delimiters to sep.
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!
Great success!
Melhor aplicativo de controle parental para proteger seus filhos – Monitorar secretamente secreto GPS, SMS, chamadas, WhatsApp, Facebook, localização. Você pode monitorar remotamente as atividades do telefone móvel após o download e instalar o apk no telefone de destino.