Replacing multiple values in a pandas DataFrame column

Without going into detail, here’s something I truly hate in R: replacing multiple values. In Python’s pandas, it’s really easy. In this blog post I try several methods: list comprehension, apply(), replace() and map().

First, let’s create some dummy data.

import pandas as pd
from timeit import timeit
import re

taste = ['sweet','sour','sweet','bitter'] * 1000
color = ['red','green','yellow','red'] * 1000
fruit = ['apple','pear','banana','cherry'] * 1000

data = {'taste': taste, 'color': color, 'fruit': fruit}

df = pd.DataFrame(data)

val = {'red':'vermillion','green':'emerald'}

First, let’s try with list comprehension. The get() function tries to find the initial color from my dictionary (first x) and replaces it with the corresponding value. The second x is what it should be replaced with if the key cannot be found. — 6.8 milliseconds

pd.Series([val.get(x,x) for x in df['color']])

Of course, we can also use an apply function. You’ll discover it is rather slow. — 10.8 milliseconds.

df['color'].apply(lambda x: val.get(x,x))

But wait, there are some pandas-native functions that are available for this purpose. replace() definitely seems to be the most elegant way. But, it’s also not very fast. If you go through the code, you’ll see that this function involves a lot of conversions. — 10.1 milliseconds

df['color'].replace(val)

map() is faster than replace. Its code is a lot more comprehensive. To my surprise, it’s also faster than apply(). — just like list comprehension: 6.8 milliseconds. We have a winner.

df['color'].map(val, na_action = 'ignore')

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Replacing multiple values in a pandas DataFrame column

Say thanks, ask questions or give feedback

1 thought on “Replacing multiple values in a pandas DataFrame column”

Leave a Reply Cancel reply

Replacing multiple values in a pandas DataFrame column

Say thanks, ask questions or give feedback

1 thought on “Replacing multiple values in a pandas DataFrame column”

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error