Having R as my data science lingua franca, some things in pandas feel less intuitive. As I’m slowly transitioning to pandas, I’ll be posting a lot of examples over the next couple of weeks. Here’s the first one: concatenating columns values of a pandas DataFrame into another (new) column.
First let’s create some dummy data. It contains tastes, colors and fruits. The goal of this example is to make a column that has a concatenation of a taste, a color and a fruit on each row. For example: ‘sweet red apple’.
import pandas as pd
from timeit import timeit
taste = ['sweet','sour','sweet','bitter']
color = ['red','green','yellow','red']
fruit = ['apple','pear','banana','cherry']
data = {'taste': taste, 'color': color, 'fruit': fruit}
df = pd.DataFrame(data)
First, let’s go with map(), which works element-wise. We apply it to the first pandas series we’d like to concatenate, and then use the + operator to concatenate each of the values to corresponding values in the other series. Since it’s element-wise, it is rather slow. — 1.5 microseconds
%%timeit
df['option1'] = df['taste'].map(str) + ' ' + df['color'] + ' ' + df['fruit']
A faster solution is by working column-wise, using the apply function (yay, something that feels like R). Although it’s faster, it feels like clunky coding. But that’s rather subjective. — 1.2 microseconds
%%timeit
df['option2'] = df.apply(lambda x: '%s %s %s' % (x['taste'], x['color'], x['fruit']), axis = 1)
When all else fails, there’s always list comprehension. It works on pandas series and it’s also superfast. — 320 nanoseconds
%%timeit
df['option3'] = ['%s %s %s' % (x, y, z) for x, y, z in zip(df['taste'], df['color'], df['fruit'])]
If you’re a fan of R’s data.table, the next solution feels the most elegant. By accessing the numpy representation of each column, we can use numpy’s string operations, which supports the + operator. — 250 nanoseconds, we have a winner!
%%timeit
df['option4'] = df['taste'].values + ' ' + df['color'].values + ' ' + df['fruit'].values
There are some variants to this last solution, but they are somewhat slower. First variant is switching .values for .to_numpy(). Which is only slightly slower, but is the recommended (and consistent) way according to the pandas documentation (0.24+).
df['option4a'] = df['taste'].to_numpy() + ' ' + df['color'].to_numpy() + ' ' + df['fruit'].to_numpy()
I’ve been using C-style string formatting a lot, so we can do that here to. To my surprise, it’s three times slower.
df['option4b'] = ('%s %s %s' % (df['taste'].values, df['color'].values, df['fruit']))
Great success!
Your point of view caught my eye and was very interesting. Thanks. I have a question for you.
Очень трендовые новинки моды.
Исчерпывающие новости всемирных подуимов.
Модные дома, торговые марки, haute couture.
Новое место для стильныех людей.
https://egomoda.ru/
Полностью трендовые новости индустрии.
Абсолютно все эвенты всемирных подуимов.
Модные дома, торговые марки, гедонизм.
Самое приятное место для трендовых хайпбистов.
https://whitesneaker.ru/
Полностью трендовые события модного мира.
Абсолютно все события известнейших подуимов.
Модные дома, торговые марки, высокая мода.
Свежее место для модных хайпбистов.
https://urban-moda.ru/