The truth is that you rarely completely control how or what data is collected. That’s why you should evaluate your data for its quality. There are many dimensions to data quality. The list will be longer or shorter, depending on who you ask.
- Data validity: To store dates or times, they need to be in the correct format. A ‘MM/DD/YY’ string could be misinterpreted when ‘YYYY-MM-DD’ is expected.
- Data uniqueness: No two values in a column or no two rows in a table should be the same.
- Data completeness: If data isn’t transformed, it should be the same in the destination system as in the source system.
- Data consistency: If data is not identical when it should be, it isn’t consistent. For example, when a customer profile exists in the e-commerce platform and the CRM, their addresses should be the same.
To mitigate problems with data not behaving as expected, data engineers implement data tests throughout an organization’s data pipelines. Data tests encode your knowledge about assumptions that need to hold for data to be processed as planned. When a test detects issues with the data, a specific action needs to be taken. The data could be marked, processed differently, stored for later processing, or trigger a notification asking for manual intervention.
Different types of data testing
There are multiple types of tests, and they can be implemented and executed in multiple phases of the DataOps development process.
- Unit tests: Data engineers run unit tests during the development process of the data pipelines. Unit tests are executed on isolated components (units), like single extractions, loads, or transformations. By running a single data object through these individual processes, one can check if the output matches the expectations, much like in software development.
- End-to-end testing: checking if a complete data pipeline behaves as expected. Typically, one executes these tests once changes are deployed and integrated in a staging environment. End-to-end tests require a data object for which you have the initial and the expected final form. So that when you run the initial data object through your data pipeline, you can compare the result to your expected form.
- Data quality testing: Often, this is what people refer to when they talk about ‘data testing’. Contrary to the previous tests, a data quality test runs continuously, on a recurring basis, or whenever a new data object has been processed. These checks can be anything, from checking for null values to date formats, capitalization, and data type. A popular open-source framework for implementing data quality tests is Great Expectations.