Concatenating column values in pandas

Having R as my data science lingua franca, some things in pandas feel less intuitive. As I’m slowly transitioning to pandas, I’ll be posting a lot of examples over the next couple of weeks. Here’s the first one: concatenating columns values of a pandas DataFrame into another (new) column.

First let’s create some dummy data. It contains tastes, colors and fruits. The goal of this example is to make a column that has a concatenation of a taste, a color and a fruit on each row. For example: ‘sweet red apple’.

import pandas as pd
from timeit import timeit

taste = ['sweet','sour','sweet','bitter']
color = ['red','green','yellow','red']
fruit = ['apple','pear','banana','cherry']

data = {'taste': taste, 'color': color, 'fruit': fruit}
df = pd.DataFrame(data)

First, let’s go with map(), which works element-wise. We apply it to the first pandas series we’d like to concatenate, and then use the + operator to concatenate each of the values to corresponding values in the other series. Since it’s element-wise, it is rather slow. — 1.5 microseconds

%%timeit
df['option1'] = df['taste'].map(str) + ' ' + df['color'] + ' ' + df['fruit']

A faster solution is by working column-wise, using the apply function (yay, something that feels like R). Although it’s faster, it feels like clunky coding. But that’s rather subjective. — 1.2 microseconds

%%timeit
df['option2'] = df.apply(lambda x: '%s %s %s' % (x['taste'], x['color'], x['fruit']), axis = 1)

When all else fails, there’s always list comprehension. It works on pandas series and it’s also superfast. — 320 nanoseconds

%%timeit
df['option3'] = ['%s %s %s' % (x, y, z) for x, y, z in zip(df['taste'], df['color'], df['fruit'])]

If you’re a fan of R’s data.table, the next solution feels the most elegant. By accessing the numpy representation of each column, we can use numpy’s string operations, which supports the + operator. — 250 nanoseconds, we have a winner!

%%timeit
df['option4'] = df['taste'].values + ' ' + df['color'].values + ' ' + df['fruit'].values

There are some variants to this last solution, but they are somewhat slower. First variant is switching .values for .to_numpy(). Which is only slightly slower, but is the recommended (and consistent) way according to the pandas documentation (0.24+).

df['option4a'] = df['taste'].to_numpy() + ' ' + df['color'].to_numpy() + ' ' + df['fruit'].to_numpy()

I’ve been using C-style string formatting a lot, so we can do that here to. To my surprise, it’s three times slower.

df['option4b'] = ('%s %s %s' % (df['taste'].values, df['color'].values, df['fruit']))

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

4 thoughts on “Concatenating column values in pandas”

binance registrering March 27, 2024 at 10:08 am

Your point of view caught my eye and was very interesting. Thanks. I have a question for you.

LeslieMULGE June 15, 2024 at 2:34 am

Очень трендовые новинки моды.
Исчерпывающие новости всемирных подуимов.
Модные дома, торговые марки, haute couture.
Новое место для стильныех людей.
https://egomoda.ru/

ZofiaPaymn June 16, 2024 at 10:50 pm

Полностью трендовые новости индустрии.
Абсолютно все эвенты всемирных подуимов.
Модные дома, торговые марки, гедонизм.
Самое приятное место для трендовых хайпбистов.
https://whitesneaker.ru/

DavidDuend June 20, 2024 at 11:21 pm

Полностью трендовые события модного мира.
Абсолютно все события известнейших подуимов.
Модные дома, торговые марки, высокая мода.
Свежее место для модных хайпбистов.
https://urban-moda.ru/

Concatenating column values in pandas

Say thanks, ask questions or give feedback

4 thoughts on “Concatenating column values in pandas”

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error