Skip to content
Home ยป R: Filter a data frame on multiple partial strings

R: Filter a data frame on multiple partial strings

Tags:

This is a blog post about a very specific topic. I wanted to filter a data frame on a set of strings that I wanted to match partially. Let’s dive right in.

Matching partially is fairly easy, and there are many libraries to choose from, with grepl and str_detect the most popular ones. However, partially matching multiple strings is rather difficult, if you don’t want to fall back on traditional for loops. I built a solution using a combination of functions that I wanted to share.

Let’s say we have a data frame that contains some varieties of pears and apples, and I want to select the varieties that contain “esta” and “uji”.

library(stringr)
library(data.table)

df <- data.table(fruit = c('zestar apple',
                           'redlove apple',
                           'kiku apple',
                           'bonne de beugny pear',
                           'chinese white pear',
                           'fuji apple',
                           'envy apple',
                           'asian pear'),
                 count = c(5,8,9,3,2,1,5,7))

to_match_partially <- c('esta','uji') # strings to match          

To explain the code below, I really outdid myself with indentation. Let’s go from the inside to the outside.

  1. I match the column fruit of the data frame df to x, which is specified higher by the lapply function. The result wil be a vector of TRUE’s and FALSE’s.
  2. I convert these to numeric, because in #6 I’ll want to sum over these.
  3. This is where I used lapply to loop over the strings I want to match partially.
  4. The lapply function will return a list of lists that contain a TRUE/FALSE, for each string to match. However, I need them unlisted (huge 1-dimension vector), to create a matrix out of them in step #5.
  5. I convert the vector to a matrix that contains the same number of columns as there are strings to match.
  6. I can now sum each row of that matrix using rowSums. If this number is higher than zero, there was a match.
  7. I convert the row sums to a logical.
  8. Make the match.
selection <- as.logical( # 7
  rowSums( # 6
    matrix( # 5
      unlist( # 4
        lapply(to_match_partially, function(x) { #3
          as.numeric( # 2
            str_detect(df$fruit,x) # 1
          )
        })
      ), 
      ncol = length(to_match_partially), byrow = F)
  )
)

df[selection,] # 8

I can’t make the claim that this is the fastest solution. But if you’ve found another solution, based on a completely different way of thinking, I’d be glad to benchmark it. Let me know in the comments of this post!

I can’t make the claim that this is the fastest solution. But if you’ve found another solution, based on a completely different way of thinking, I’d be glad to benchmark it. Let me know in the comments of this post!

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

1 thought on “R: Filter a data frame on multiple partial strings”

Leave a Reply

Your email address will not be published. Required fields are marked *