In a previous post, I explained how you can sample two Pandas DataFrame exactly the same way. In this blog post, I want to use that helper function to undersample your predictors and target variable.
When you are working with an imbalanced data set, it’s often good practice to under- or oversample your data for training your model. While there are some great Python packages to under- and oversample your datasets, none are fully built with DataFrames in mind. That’s why I wrote a simple undersample function that returns an undersampled version of your DataFrames.
For the coding, I assume two things (but feel free to tailor the code to your specific needs — and share it in the comments).
- you have your predictors and target variables in separate data frames.
- you are working on a binary classification problem.
First, let’s load the helper function from the previous blog post.
import pandas as pd import random def sample_together(n, X, y): rows = random.sample(np.arange(0,len(X.index)).tolist(),n) return X.iloc[rows,], y.iloc[rows,]
Next, we get to the undersample function. It takes three arguments: a predictor DataFrame, a target DataFrame and the label of the minority class.
def undersample(X, y, under = 0): y_min = y[y.project_is_approved == under] y_max = y[y.project_is_approved != under] X_min = X.filter(y_min.index,axis = 0) X_max = X.filter(y_max.index,axis = 0) X_under, y_under = sample_together(len(y_min.index), X_max, y_max) X = pd.concat([X_under, X_min]) y = pd.concat([y_under, y_min]) return X, y X_train, y_train = undersample(X_train, y_train)
- Both DataFrame get split in two: one for the majority and one for the minority class.
- The sample_together function is used and the sample size of the majority class is set to the minority class sample size. The resampled DataFrames for the majority class are returned.
- I union the DataFrames of the minority and the majority class and return them.
There you have it: a function to easily undersample a Pandas DataFrame for a binary classification problem.