Without going into detail, here’s something I truly hate in R: replacing multiple values. In Python’s pandas, it’s really easy. In this blog post I try several methods: list comprehension, apply(), replace() and map().
First, let’s create some dummy data.
import pandas as pd
from timeit import timeit
import re
taste = ['sweet','sour','sweet','bitter'] * 1000
color = ['red','green','yellow','red'] * 1000
fruit = ['apple','pear','banana','cherry'] * 1000
data = {'taste': taste, 'color': color, 'fruit': fruit}
df = pd.DataFrame(data)
val = {'red':'vermillion','green':'emerald'}
First, let’s try with list comprehension. The get() function tries to find the initial color from my dictionary (first x) and replaces it with the corresponding value. The second x is what it should be replaced with if the key cannot be found. — 6.8 milliseconds
pd.Series([val.get(x,x) for x in df['color']])
Of course, we can also use an apply function. You’ll discover it is rather slow. — 10.8 milliseconds.
df['color'].apply(lambda x: val.get(x,x))
But wait, there are some pandas-native functions that are available for this purpose. replace() definitely seems to be the most elegant way. But, it’s also not very fast. If you go through the code, you’ll see that this function involves a lot of conversions. — 10.1 milliseconds
df['color'].replace(val)
map() is faster than replace. Its code is a lot more comprehensive. To my surprise, it’s also faster than apply(). — just like list comprehension: 6.8 milliseconds. We have a winner.
df['color'].map(val, na_action = 'ignore')
Great success!
Pingback: Facebook like count