Site icon Roel Peters

Undersampling a Pandas DataFrame

In a previous post, I explained how you can sample two Pandas DataFrame exactly the same way. In this blog post, I want to use that helper function to undersample your predictors and target variable.

When you are working with an imbalanced data set, it’s often good practice to under- or oversample your data for training your model. While there are some great Python packages to under- and oversample your datasets, none are fully built with DataFrames in mind. That’s why I wrote a simple undersample function that returns an undersampled version of your DataFrames.

For the coding, I assume two things (but feel free to tailor the code to your specific needs — and share it in the comments).

First, let’s load the helper function from the previous blog post.

import pandas as pd
import random

def sample_together(n, X, y):
    rows = random.sample(np.arange(0,len(X.index)).tolist(),n)
    return X.iloc[rows,], y.iloc[rows,]

Next, we get to the undersample function. It takes three arguments: a predictor DataFrame, a target DataFrame and the label of the minority class.

def undersample(X, y, under = 0):
    y_min = y[y.project_is_approved == under]
    y_max = y[y.project_is_approved != under]
    X_min = X.filter(y_min.index,axis = 0)
    X_max = X.filter(y_max.index,axis = 0)

    X_under, y_under = sample_together(len(y_min.index), X_max, y_max)
    
    X = pd.concat([X_under, X_min])
    y = pd.concat([y_under, y_min])
    return X, y

X_train, y_train = undersample(X_train, y_train)

What happens:

There you have it: a function to easily undersample a Pandas DataFrame for a binary classification problem.

By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Exit mobile version