Home » Solving “Found unknown categories […] in column” with sklearn OneHotEncoder

Solving “Found unknown categories […] in column” with sklearn OneHotEncoder

  • by
  • 1 min read

In this short blog post, I tackle an error related to a classic problem within machine learning: how to treat unseen categorical values and solve the “found unknown categories” error.

Imagine you have a train and a test data set with the following values in a column:

  • Train: [‘green’, ‘red’, ‘blue’, ‘red’, ‘blue’, ‘blue’]
  • Test: [‘red’, ‘green’, ‘green’, ‘blue’, ‘yellow’, ‘blue’]

If you one-hot encode these values, and you don’t take precautions to handle unseen values, scikit-learn‘s OneHotEncoder will return the following error.

Found unknown categories […] in column

Why? Because when you train your one-hot encoder on the train set, it doesn’t know what to do when it meets a value it hasn’t seen before. Or rather, it does know what to do: raise an error. Here’s what the documentation says:

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise).

To make sure you do not get an error, and the value is simply ignored — meaning all zeroes in the binary columns green, red and blue — you need to set one particular argument, which is handle_unknown to “ignore”.

OHE_model = OneHotEncoder(handle_unknown = 'ignore')

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *