Data monitoring is closely related to data testing. They both intend to preserve or improve the quality of data. But monitoring starts from another philosophy. Instead of testing data against known scenarios, monitoring your data means collecting, storing, and analyzing various properties of data.
When a data monitoring system detects an anomaly, it will usually send out an alert. Modern monitoring systems also provide data observability of data pipelines: by combining the complete output of a data system, they indicate what might be the root cause of an anomaly.
There are various reasons why monitoring is relevant.
- Unknown unknowns: To write data tests, you need to know all the different you want to test for in advance. This means that large organizations with many pipelines might have hundreds or thousands of tests in place, but they’ll never be able to catch data issues they didn’t even know could happen. Data monitoring notifies them about untested issues.
- Data changes: Downstream tests are rarely designed to catch data drift.
- Changing data pipelines: Businesses evolve, and their data products evolve with them. Implemented changes often break existing downstream logic in ways that tests don’t account for. Monitoring tools can help identify these problems quickly, both in testing and production environments.
- Testing debt: An organization’s data pipelines might have been up and running for years. However, there’s a chance they are from an era when the data maturity was low, and testing was not a priority. With such technical debt, debugging pipelines can take a while. Monitoring tools can guide organizations in setting up proper tests.