Home » Concatenating column values in pandas

Concatenating column values in pandas

  • by
  • 3 min read

Having R as my data science lingua franca, some things in pandas feel less intuitive. As I’m slowly transitioning to pandas, I’ll be posting a lot of examples over the next couple of weeks. Here’s the first one: concatenating columns values of a pandas DataFrame into another (new) column.

First let’s create some dummy data. It contains tastes, colors and fruits. The goal of this example is to make a column that has a concatenation of a taste, a color and a fruit on each row. For example: ‘sweet red apple’.

import pandas as pd
from timeit import timeit

taste = ['sweet','sour','sweet','bitter']
color = ['red','green','yellow','red']
fruit = ['apple','pear','banana','cherry']

data = {'taste': taste, 'color': color, 'fruit': fruit}
df = pd.DataFrame(data)

First, let’s go with map(), which works element-wise. We apply it to the first pandas series we’d like to concatenate, and then use the + operator to concatenate each of the values to corresponding values in the other series. Since it’s element-wise, it is rather slow. — 1.5 microseconds

df['option1'] = df['taste'].map(str) + ' ' + df['color'] + ' ' + df['fruit']

A faster solution is by working column-wise, using the apply function (yay, something that feels like R). Although it’s faster, it feels like clunky coding. But that’s rather subjective. — 1.2 microseconds

df['option2'] = df.apply(lambda x: '%s %s %s' % (x['taste'], x['color'], x['fruit']), axis = 1)

When all else fails, there’s always list comprehension. It works on pandas series and it’s also superfast. — 320 nanoseconds

df['option3'] = ['%s %s %s' % (x, y, z) for x, y, z in zip(df['taste'], df['color'], df['fruit'])]

If you’re a fan of R’s data.table, the next solution feels the most elegant. By accessing the numpy representation of each column, we can use numpy’s string operations, which supports the + operator. — 250 nanoseconds, we have a winner!

df['option4'] = df['taste'].values + ' ' + df['color'].values + ' ' + df['fruit'].values

There are some variants to this last solution, but they are somewhat slower. First variant is switching .values for .to_numpy(). Which is only slightly slower, but is the recommended (and consistent) way according to the pandas documentation (0.24+).

df['option4a'] = df['taste'].to_numpy() + ' ' + df['color'].to_numpy() + ' ' + df['fruit'].to_numpy()

I’ve been using C-style string formatting a lot, so we can do that here to. To my surprise, it’s three times slower.

df['option4b'] = ('%s %s %s' % (df['taste'].values, df['color'].values, df['fruit']))

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *