In this blog post, two quick solutions to sample two pandas data frames in the same way. Why? Because sometimes you load your predictors (X) and your target (y) from two different files and you don’t want to put them together in one table.
Sampling two data frames with a different index
In the first chunk of code, I present you a simple helper function. It samples two data frames in exactly the same way. By taking a random sample of numbers with a maximum equal to the number of rows, one can use these as indexes for both data frames.
import numpy as np import pandas as pd import random def sample_together(n, X, y): rows = random.sample(np.arange(0,len(X.index)).tolist(),n) return X.iloc[rows,], y.iloc[rows,] df_sample, target_sample = sample_together(1000, df, target)
By the way, this is a good case for an iterable function. You can change the function so that it takes one or more (~infinite) data frames. It will return the number of data frames that need to be sampled in the same way.
def sample_together(n, args): rows = random.sample(np.arange(0,len(X.index)).tolist(),n) return tuple(arg.iloc[rows,] for arg in args) df_sample, target_sample, target2_sample = sample_together(2, [df, target, target2])
Sampling two data frames with the same index
However, if the data frames have the same index, it’s a lot easier. In this case, you can simply use Pandas’ native sample function. Use the returned data frame’s index to slice the second data frame.
df = df.sample(1000) target = target[df.index,]
By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.