A problem you run into fairly early in a data scientists’ career is replacing a lot of patterns. Of course, you can write a ton load of gsub functions, but that becomes tiring really fast. In this blog post I elaborate on three functions from three separate libraries that can do the same thing, in a more concise way.
First, let’s create a dummy sentence.
s <- 'The quick brown fox jumps over the lazy dog'
The gsubfn function (from the library with the same name) accepts a pattern to look for and a list that explains what the replacements should be. It’s not really fast. Using microbenchmark, this function took 250 microseconds to run.
library(gsubfn) s <- gsubfn('fox|over|dog', list('fox' = 'horse','over' = 'on', 'dog' = 'wolf'),s)
We can also use the popular magrittr package to achieve the same goal. By chaining the gsub function using the pipe operator, this can be quite concise, and it’s also double as fast as gsubfn(). However, it’s still a lot to write, and only if you have a long variable name, you’ll have some efficiency gains.
library(magrittr) s %<>% gsub('fox','horse',.) %>% gsub('over','on',.) %>% gsub('dog','wolf',.)
Finally, there’s stringi. In my opinion, it’s a swiss knife in string manipulation. And it’s known to be blazingly fast. By running a microbenchmark, one can identify it is up to 8 times faster than what we started with.
library(stringi) s <- stri_replace_all_regex(s, c('fox','over','dog'), c('horse','on','wolf'), vectorize=F)
Conclusion: the fastest way to remove multiple patterns in a string is by using stringi. Another problem solved!
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!