Home ยป Undersampling a Pandas DataFrame

Undersampling a Pandas DataFrame

  • by
  • 2 min read

In a previous post, I explained how you can sample two Pandas DataFrame exactly the same way. In this blog post, I want to use that helper function to undersample your predictors and target variable.

When you are working with an imbalanced data set, it’s often good practice to under- or oversample your data for training your model. While there are some great Python packages to under- and oversample your datasets, none are fully built with DataFrames in mind. That’s why I wrote a simple undersample function that returns an undersampled version of your DataFrames.

For the coding, I assume two things (but feel free to tailor the code to your specific needs — and share it in the comments).

  • you have your predictors and target variables in separate data frames.
  • you are working on a binary classification problem.

First, let’s load the helper function from the previous blog post.

import pandas as pd
import random

def sample_together(n, X, y):
    rows = random.sample(np.arange(0,len(X.index)).tolist(),n)
    return X.iloc[rows,], y.iloc[rows,]

Next, we get to the undersample function. It takes three arguments: a predictor DataFrame, a target DataFrame and the label of the minority class.

def undersample(X, y, under = 0):
    y_min = y[y.project_is_approved == under]
    y_max = y[y.project_is_approved != under]
    X_min = X.filter(y_min.index,axis = 0)
    X_max = X.filter(y_max.index,axis = 0)

    X_under, y_under = sample_together(len(y_min.index), X_max, y_max)
    
    X = pd.concat([X_under, X_min])
    y = pd.concat([y_under, y_min])
    return X, y

X_train, y_train = undersample(X_train, y_train)

What happens:

  • Both DataFrame get split in two: one for the majority and one for the minority class.
  • The sample_together function is used and the sample size of the majority class is set to the minority class sample size. The resampled DataFrames for the majority class are returned.
  • I union the DataFrames of the minority and the majority class and return them.

There you have it: a function to easily undersample a Pandas DataFrame for a binary classification problem.

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *