What is Data Shift?
Data shift— or dataset shift, model drift, data drift– is the phenomenon that describes the change in input data in your model (over time), relative to the data it was trained on. It is one of the most common reasons for degrading model accuracy. That’s why there is a whole industry of tools that allow you to monitor your models in production.
Ther is a multitude of reasons for data shift:
- A replacement of sensors that capture data in a slightly different way than their predecessor
- Human error: e.g. somebody broke the data pipeline, the sensor is no longer connected, …
- A “natural” drift e.g. because of a change in human preferences
What all these causes have in common is that the real-world data differs from the data a model was trained upon. However, we can classify a data shift in different categories.
- Covariate shift: The distribution of (a) particular covariate(s) is different in the test set. For example: a facial recognition algorithm is trained on a dataset that includes an ethnicity that is overrepresented.
- Prior probability shift: The distribution of the dependent variable in the test set is different from the training set. A widely used example is how the amount/proportion of spam emails changes over time — i.e. its prior probability changes.
- Concept shift: The relationship between the dependent and independent variable changes. E.g. because of seasonality, which is not in the model.
The field of domain adaptation deals with “the ability to apply an algorithm trained in one or more “source domains” to a different (but related) “target domain”.
Want to know more?
- Some great conference slides
- Georgios Sarantitis described an original lockdown-related example of prior probability shift on his blog. He also describes some solutions.
- A great post about covariate shift