What is data leakage?
Within the field of machine learning, data leakage is a term used to describe how data from outside the training data set is used to create the model. This is a problem because, within machine learning, our goal is to develop a model that is good at making predictions on unseen data. If some of that unseen data (from our validation or test set) leaks into the modeling process, we are building an unfair advantage in the model. In other words: the model will be cheating when cross-validated on the validation or test set. This is also known as train-test contamination.
Examples of data leakage
- When modeling some features, like using clusters or topics, on the complete data set — and not only the train set.
- One-hot encoding features on the complete data set.
- Normalization of features on the complete data set
- Oversampling before splitting into a train and test set can create a huge amount of duplicates.
- Non-independent data: imagine having a data set with 5 snapshots per day of 100 weather stations. You should split the weather stations and not the snapshots, in order to prevent your model from seeing data from all weather stations.
How to prevent data leakage?
There is no silver bullet to prevent data leakage. But there are some sanity checks that you can build into your modeling process:
- If the performance of your model is too good to be true, it probably. Check all the steps in your modeling process for data leakage.
- Use pipelines to prevent data leakage, like sklearn pipelines — they force you to think in a specific framework that is particularly good at prevent data leakage.
- Hold an extra data set back for a final test of your model
Watch this great presentation by Yuriy Guts if you want to know more.