Home » Sample two Pandas data frames in the same way

Sample two Pandas data frames in the same way

• 2 min read

In this blog post, two quick solutions to sample two pandas data frames in the same way. Why? Because sometimes you load your predictors (X) and your target (y) from two different files and you don’t want to put them together in one table.

Sampling two data frames with a different index

In the first chunk of code, I present you a simple helper function. It samples two data frames in exactly the same way. By taking a random sample of numbers with a maximum equal to the number of rows, one can use these as indexes for both data frames.

import numpy as np
import pandas as pd
import random

def sample_together(n, X, y):
rows = random.sample(np.arange(0,len(X.index)).tolist(),n)
return X.iloc[rows,], y.iloc[rows,]

df_sample, target_sample = sample_together(1000, df, target)

By the way, this is a good case for an iterable function. You can change the function so that it takes one or more (~infinite) data frames. It will return the number of data frames that need to be sampled in the same way.

def sample_together(n, args):
rows = random.sample(np.arange(0,len(X.index)).tolist(),n)
return tuple(arg.iloc[rows,] for arg in args)

df_sample, target_sample, target2_sample = sample_together(2, [df, target, target2])

Sampling two data frames with the same index

However, if the data frames have the same index, it’s a lot easier. In this case, you can simply use Pandas’ native sample function. Use the returned data frame’s index to slice the second data frame.

df = df.sample(1000)
target = target[df.index,]

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.