Partial match of pandas DataFrame column name

Here’s another post about seemingly convenient things in pandas. Let’s produce a DataFrame from partially matched columns from another DataFrame.

Let’s create some dummy data.

import pandas as pd
from timeit import timeit
import re

taste = ['sweet','sour','sweet','bitter']
color = ['red','green','yellow','red']
secondary_color = ['green','red','brown','orange']
fruit = ['apple','pear','banana','cherry']

data = {'taste': taste, 'color': color, 'fruit': fruit, 'secondary_color': secondary_color}

df = pd.DataFrame(data)

Let’s say we are trying to select the columns that contain the world ‘color’. Here’s a solution I found on the web. By converting the column names to a pandas series and using its vectorized string operations we can filter the columns names using the contains() functions. This solution is not particularly fast: 1.12 milliseconds.

%%timeit
df[df.columns[df.columns.to_series().str.contains('color')]] # Vectorized string operations

We can do better. And when I think about speed, I think about list comprehension. Here, we use re’s search function to find a particular substring while looping over each and every column name: 730 nanoseconds.

%%timeit
df[df.columns[[bool(re.compile('color').search(x)) for x in df.columns.values]]]

But we can do even better. Apparently, pandas has a built-in solution. It’s the filter() function. By using the like parameter, we set a filter that looks for partial matches. The result: 540 nanoseconds. That’s 26% faster than the list comprehension, and 52% faster than the vectorized string operations.

%%timeit
df_colors = df.filter(like='color')

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Partial match of pandas DataFrame column name

Say thanks, ask questions or give feedback

2 thoughts on “Partial match of pandas DataFrame column name”

Leave a Reply Cancel reply

Partial match of pandas DataFrame column name

Say thanks, ask questions or give feedback

2 thoughts on “Partial match of pandas DataFrame column name”

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error