Change data capture (CDC) is a data replication method that tracks changes in a database and publishes them as messages to a real-time stream. Downstream systems can consume this row-level change feed for various (analytical and operational) purposes.
There are various ways that an event stream is processed, but they can be grouped in two broad categories.
In a pull-based system, the target database frequently polls the source system for changes. The benefit is that no single message gets lost, even when the target system is temporarily down. The other side of the coin is that this technique introduces a delay into the system, as messages are only transmitted when the target system asks for them.
In a push-based system, messages are broadcasted to target systems. This system is really real-time because the source system doesn’t need to wait for the target system’s polling. However, it also implies that there’s no guarantee that messages will arrive at their destination, for example when the target system is down.
It is possible to enjoy the best of both worlds. By using a message queue or stream processing platform, sources can broadcast, and target systems can consume all messages in real-time. To guarantee that none are lost, they are stored in a queue. When the target is temporarily unavailable, and messages aren’t consumed immediately, they are stored until the target is back online.
Many database vendors keep track of changes via log files: chronologically stored log messages that describe the changes and when they occurred.
These logs come in two different formats.
Statement-based replication (SBR): statement-based logs describe messages as SQL statements. Capturing all changes can be done by executing the SQL statements on the target data store.
An important aspect of SBR is that it requires less storage and bandwidth because updates that affect multiple rows are stored as a single statement.
The downside is that this method cannot recreate non-deterministic SQL statements. For example: a DELETE or UPDATE statement that contains a LIMIT without an ORDER BY clause.
Row-based replication (RBR): row-based logs contain events that describe how table rows have changed. Capturing changes can be done by copying all events to the target data store.
RBR ensures that non-deterministic changes are recorded. The other side of the coin is that RBR requires a lot of storage space: changes that include many rows contain a lot of log messages.