In this short blog post, I tackle an error related to a classic problem within machine learning: how to treat unseen categorical values and solve the “found unknown categories” error.
Imagine you have a train and a test data set with the following values in a column:
- Train: [‘green’, ‘red’, ‘blue’, ‘red’, ‘blue’, ‘blue’]
- Test: [‘red’, ‘green’, ‘green’, ‘blue’, ‘yellow’, ‘blue’]
If you one-hot encode these values, and you don’t take precautions to handle unseen values, scikit-learn‘s OneHotEncoder will return the following error.
Found unknown categories […] in column…
Why? Because when you train your one-hot encoder on the train set, it doesn’t know what to do when it meets a value it hasn’t seen before. Or rather, it does know what to do: raise an error. Here’s what the documentation says:
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise).
To make sure you do not get an error, and the value is simply ignored — meaning all zeroes in the binary columns green, red and blue — you need to set one particular argument, which is handle_unknown to “ignore”.
OHE_model = OneHotEncoder(handle_unknown = 'ignore')
You’re welcome 😉
Thank you, it helped! 🙂