Here’s another post about seemingly convenient things in pandas. Let’s produce a DataFrame from partially matched columns from another DataFrame.
Let’s create some dummy data.
import pandas as pd
from timeit import timeit
import re
taste = ['sweet','sour','sweet','bitter']
color = ['red','green','yellow','red']
secondary_color = ['green','red','brown','orange']
fruit = ['apple','pear','banana','cherry']
data = {'taste': taste, 'color': color, 'fruit': fruit, 'secondary_color': secondary_color}
df = pd.DataFrame(data)
Let’s say we are trying to select the columns that contain the world ‘color’. Here’s a solution I found on the web. By converting the column names to a pandas series and using its vectorized string operations we can filter the columns names using the contains() functions. This solution is not particularly fast: 1.12 milliseconds.
%%timeit
df[df.columns[df.columns.to_series().str.contains('color')]] # Vectorized string operations
We can do better. And when I think about speed, I think about list comprehension. Here, we use re’s search function to find a particular substring while looping over each and every column name: 730 nanoseconds.
%%timeit
df[df.columns[[bool(re.compile('color').search(x)) for x in df.columns.values]]]
But we can do even better. Apparently, pandas has a built-in solution. It’s the filter() function. By using the like parameter, we set a filter that looks for partial matches. The result: 540 nanoseconds. That’s 26% faster than the list comprehension, and 52% faster than the vectorized string operations.
%%timeit
df_colors = df.filter(like='color')
Great success!
MERTYHR955465MAVNGHJTH
Your article helped me a lot, is there any more related content? Thanks!
Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me?