Home » R: Filter a data frame on multiple partial strings

R: Filter a data frame on multiple partial strings

• 3 min read
Tags:

This is a blog post about a very specific topic. I wanted to filter a data frame on a set of strings that I wanted to match partially. Let’s dive right in.

Matching partially is fairly easy, and there are many libraries to choose from, with grepl and str_detect the most popular ones. However, partially matching multiple strings is rather difficult, if you don’t want to fall back on traditional for loops. I built a solution using a combination of functions that I wanted to share.

Let’s say we have a data frame that contains some varieties of pears and apples, and I want to select the varieties that contain “esta” and “uji”.

library(stringr)
library(data.table)

df <- data.table(fruit = c('zestar apple',
'redlove apple',
'kiku apple',
'bonne de beugny pear',
'chinese white pear',
'fuji apple',
'envy apple',
'asian pear'),
count = c(5,8,9,3,2,1,5,7))

to_match_partially <- c('esta','uji') # strings to match


To explain the code below, I really outdid myself with indentation. Let’s go from the inside to the outside.

1. I match the column fruit of the data frame df to x, which is specified higher by the lapply function. The result wil be a vector of TRUE’s and FALSE’s.
2. I convert these to numeric, because in #6 I’ll want to sum over these.
3. This is where I used lapply to loop over the strings I want to match partially.
4. The lapply function will return a list of lists that contain a TRUE/FALSE, for each string to match. However, I need them unlisted (huge 1-dimension vector), to create a matrix out of them in step #5.
5. I convert the vector to a matrix that contains the same number of columns as there are strings to match.
6. I can now sum each row of that matrix using rowSums. If this number is higher than zero, there was a match.
7. I convert the row sums to a logical.
8. Make the match.
selection <- as.logical( # 7
rowSums( # 6
matrix( # 5
unlist( # 4
lapply(to_match_partially, function(x) { #3
as.numeric( # 2
str_detect(df\$fruit,x) # 1
)
})
),
ncol = length(to_match_partially), byrow = F)
)
)

df[selection,] # 8


I can’t make the claim that this is the fastest solution. But if you’ve found another solution, based on a completely different way of thinking, I’d be glad to benchmark it. Let me know in the comments of this post!

I can’t make the claim that this is the fastest solution. But if you’ve found another solution, based on a completely different way of thinking, I’d be glad to benchmark it. Let me know in the comments of this post!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.