Solving "Found unknown categories [...] in column" with sklearn OneHotEncoder

In this short blog post, I tackle an error related to a classic problem within machine learning: how to treat unseen categorical values and solve the “found unknown categories” error.

Imagine you have a train and a test data set with the following values in a column:

Train: [‘green’, ‘red’, ‘blue’, ‘red’, ‘blue’, ‘blue’]
Test: [‘red’, ‘green’, ‘green’, ‘blue’, ‘yellow’, ‘blue’]

If you one-hot encode these values, and you don’t take precautions to handle unseen values, scikit-learn‘s OneHotEncoder will return the following error.

Found unknown categories […] in column…

Why? Because when you train your one-hot encoder on the train set, it doesn’t know what to do when it meets a value it hasn’t seen before. Or rather, it does know what to do: raise an error. Here’s what the documentation says:

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise).

To make sure you do not get an error, and the value is simply ignored — meaning all zeroes in the binary columns green, red and blue — you need to set one particular argument, which is handle_unknown to “ignore”.

OHE_model = OneHotEncoder(handle_unknown = 'ignore')

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

4 thoughts on “Solving “Found unknown categories […] in column” with sklearn OneHotEncoder”

Tao January 21, 2021 at 12:23 am

worked! Thx!

1. roelpi January 22, 2021 at 2:50 pm
  
  You’re welcome 😉
  
vb_datascien February 26, 2021 at 5:19 pm

Life saver!!!!

sas December 19, 2021 at 3:14 am

Thank you, it helped! 🙂

Solving “Found unknown categories […] in column” with sklearn OneHotEncoder

Say thanks, ask questions or give feedback

4 thoughts on “Solving “Found unknown categories […] in column” with sklearn OneHotEncoder”

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error