pandas

Using scikit’s OneHotEncoder only on categorical variables of a data frame

by roelpi
September 12, 2020September 13, 2020
2 min read

I’ve been trying to build a model using machine learning today and I bumped into an error when I wanted to dummify my categorical predictors. It seemed I didn’t really know how Scikit’s OneHotEncoder worked. But I do now. And I want to share it with you. I had a…

by roelpi
September 10, 2020March 30, 2021
2 min read

In this blog post, two quick solutions to sample two pandas data frames in the same way. Why? Because sometimes you load your predictors (X) and your target (y) from two different files and you don’t want to put them together in one table. Sampling two data frames with a…

by roelpi
August 23, 2020March 30, 2021
2 Comments
2 min read

The SettingWithCopyWarning message is a confusing warning to many who are new to Pandas. If you’ve ever taken a computer science course, you might be aware of passing/copying by value or by reference. Well, it very much applies to pandas DataFrames too. Let’s go. Basically, when you are slicing a…

by roelpi
August 22, 2020March 30, 2021
2 min read

This is a cool one I used for a feature engineering task I did recently. I had multiple documents in a Pandas DataFrame, in long format. These documents belonged to people and it had an n:1 relation: people could have multiple documents. I was wondering how to concatenate each person’s…

by roelpi
August 22, 2020September 3, 2021
1 Comment
2 min read

Coming from R, I’m not a big fan of MultiIndex DataFrames in Pandas. I definitely see the merits, but it just doesn’t feel right within a machine learning and feature engineering context. However, sometimes you will end up with a MultiIndex DataFrame, after some ninja line of code. In this…

by roelpi
August 2, 2020October 2, 2020
1 Comment
3 min read

There are many ways to remove a column in a pandas DataFrame. However, some ways are better than others. In this blog post, I elaborate on multiple solutions and what the pros and cons are. First, let’s load the iris dataset from the Seaborn package on GitHub. Drop a pandas…

by roelpi
August 1, 2020March 30, 2021
1 Comment
2 min read

In some situations, especially when adding some basic error handling to your Python scripts, you want to check if a column exists before performing operations on it. In this blog post I tell you how. For starters, let’s load the iris dataset from the Seaborn package on GitHub. I’ll be…

by roelpi
July 26, 2020August 12, 2020
2 Comments
3 min read

Another blog post about an absolute starter subject. Calculating the numbers of rows of a pandas DataFrame is really simple. Yet there are many ways to do them. In this blog post, we elaborate on each and every one of the possible solutions. By looking at the codebase of pandas,…

by roelpi
July 25, 2020March 30, 2021
5 Comments
2 min read

Apparently, this is something that many (even experienced) data scientists still google. Sometimes you’re dealing with a comma-separated value file that has no header. In this blog post I explain how to deal with this when you’re loading these files with pandas in Python. The read_csv function in pandas is…

by roelpi
July 16, 2020March 30, 2021
2 Comments
3 min read

It’s an operation I use quite a lot, but I never took the time to find out which solution is fastest. That’s why I decided to dedicate this blog post on how to concatenate columns in pandas (Python) and their execution speed. I’ve been working with the “donors” dataset from…

pandas

Using scikit’s OneHotEncoder only on categorical variables of a data frame

Sample two Pandas data frames in the same way

How to solve SettingWithCopyWarning when using the ‘inplace’ parameter in pandas

How to concatenate text as aggregation in a Pandas groupby

How to flatten a MultiIndex Pandas DataFrame

How to: pandas – drop column

How to check if a column exists in a pandas DataFrame

Get the number of rows of a pandas DataFrame

Reading a CSV without header in pandas properly

Concatenate columns in pandas (as fast as possible)