Home ยป Partial match of pandas DataFrame column name

Partial match of pandas DataFrame column name

  • by
partial match pandas dataframe columns
Want to do a random act of kindness? Share this post.

Here’s another post about seemingly convenient things in pandas. Let’s produce a DataFrame from partially matched columns from another DataFrame.

Let’s create some dummy data.

import pandas as pd
from timeit import timeit
import re

taste = ['sweet','sour','sweet','bitter']
color = ['red','green','yellow','red']
secondary_color = ['green','red','brown','orange']
fruit = ['apple','pear','banana','cherry']

data = {'taste': taste, 'color': color, 'fruit': fruit, 'secondary_color': secondary_color}

df = pd.DataFrame(data)

Let’s say we are trying to select the columns that contain the world ‘color’. Here’s a solution I found on the web. By converting the column names to a pandas series and using its vectorized string operations we can filter the columns names using the contains() functions. This solution is not particularly fast: 1.12 milliseconds.

%%timeit
df[df.columns[df.columns.to_series().str.contains('color')]] # Vectorized string operations

We can do better. And when I think about speed, I think about list comprehension. Here, we use re’s search function to find a particular substring while looping over each and every column name: 730 nanoseconds.

%%timeit
df[df.columns[[bool(re.compile('color').search(x)) for x in df.columns.values]]]

But we can do even better. Apparently, pandas has a built-in solution. It’s the filter() function. By using the like parameter, we set a filter that looks for partial matches. The result: 540 nanoseconds. That’s 26% faster than the list comprehension, and 52% faster than the vectorized string operations.

%%timeit
df_colors = df.filter(like='color')

Great success!

Want to do a random act of kindness? Share this post.