Site icon Roel Peters

Data Leakage

What is data leakage?

Within the field of machine learning, data leakage is a term used to describe how data from outside the training data set is used to create the model. This is a problem because, within machine learning, our goal is to develop a model that is good at making predictions on unseen data. If some of that unseen data (from our validation or test set) leaks into the modeling process, we are building an unfair advantage in the model. In other words: the model will be cheating when cross-validated on the validation or test set. This is also known as train-test contamination.

Examples of data leakage

How to prevent data leakage?

There is no silver bullet to prevent data leakage. But there are some sanity checks that you can build into your modeling process:

Watch this great presentation by Yuriy Guts if you want to know more.

Exit mobile version