Sample two Pandas data frames in the same way

In this blog post, two quick solutions to sample two pandas data frames in the same way. Why? Because sometimes you load your predictors (X) and your target (y) from two different files and you don’t want to put them together in one table.

Sampling two data frames with a different index

In the first chunk of code, I present you a simple helper function. It samples two data frames in exactly the same way. By taking a random sample of numbers with a maximum equal to the number of rows, one can use these as indexes for both data frames.

import numpy as np
import pandas as pd
import random

def sample_together(n, X, y):
    rows = random.sample(np.arange(0,len(X.index)).tolist(),n)
    return X.iloc[rows,], y.iloc[rows,]
  
df_sample, target_sample = sample_together(1000, df, target)

By the way, this is a good case for an iterable function. You can change the function so that it takes one or more (~infinite) data frames. It will return the number of data frames that need to be sampled in the same way.

def sample_together(n, args):
    rows = random.sample(np.arange(0,len(X.index)).tolist(),n)
    return tuple(arg.iloc[rows,] for arg in args)

df_sample, target_sample, target2_sample = sample_together(2, [df, target, target2])

Sampling two data frames with the same index

However, if the data frames have the same index, it’s a lot easier. In this case, you can simply use Pandas’ native sample function. Use the returned data frame’s index to slice the second data frame.

df = df.sample(1000)
target = target[df.index,]

By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Great success!

Sample two Pandas data frames in the same way

Sampling two data frames with a different index

Sampling two data frames with the same index

Say thanks, ask questions or give feedback

Say thanks, ask questions or give feedback

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error