Undersampling a Pandas DataFrame

In a previous post, I explained how you can sample two Pandas DataFrame exactly the same way. In this blog post, I want to use that helper function to undersample your predictors and target variable.

When you are working with an imbalanced data set, it’s often good practice to under- or oversample your data for training your model. While there are some great Python packages to under- and oversample your datasets, none are fully built with DataFrames in mind. That’s why I wrote a simple undersample function that returns an undersampled version of your DataFrames.

For the coding, I assume two things (but feel free to tailor the code to your specific needs — and share it in the comments).

you have your predictors and target variables in separate data frames.
you are working on a binary classification problem.

First, let’s load the helper function from the previous blog post.

import pandas as pd
import random

def sample_together(n, X, y):
    rows = random.sample(np.arange(0,len(X.index)).tolist(),n)
    return X.iloc[rows,], y.iloc[rows,]

Next, we get to the undersample function. It takes three arguments: a predictor DataFrame, a target DataFrame and the label of the minority class.

def undersample(X, y, under = 0):
    y_min = y[y.project_is_approved == under]
    y_max = y[y.project_is_approved != under]
    X_min = X.filter(y_min.index,axis = 0)
    X_max = X.filter(y_max.index,axis = 0)

    X_under, y_under = sample_together(len(y_min.index), X_max, y_max)
    
    X = pd.concat([X_under, X_min])
    y = pd.concat([y_under, y_min])
    return X, y

X_train, y_train = undersample(X_train, y_train)

What happens:

Both DataFrame get split in two: one for the majority and one for the minority class.
The sample_together function is used and the sample size of the majority class is set to the minority class sample size. The resampled DataFrames for the majority class are returned.
I union the DataFrames of the minority and the majority class and return them.

There you have it: a function to easily undersample a Pandas DataFrame for a binary classification problem.

By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Great success!

Undersampling a Pandas DataFrame

Say thanks, ask questions or give feedback

Say thanks, ask questions or give feedback

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error