Change Data Capture (CDC) vs. Business Event Pipelines
--
As more and more of software is shifting to rely on large amounts of data, and relying less and less solely on the request-response cycle, opting instead for event-driven architectures, choosing the right way to track changes and user actions in your system, and presenting users with the data and analytics they want, becomes more relevant.
Both CDC and business event pipelines can be built with service-oriented architectures in mind. They can both use a pub-sub model, allowing for a variety of consumers of the data they produce, but actions taken by users will produce data in different formats, depending on which you choose.
This is not a deep dive into how to implement either (I plan to write more in-depth about both in the future!), but rather a high-level look at what the two are and the differences between them to help you understand why a system might implement both, or to help you choose what is more relevant for your needs.
Change Data Capture
What is it?
CDC captures changes made in a database, at the row level, and replicates them to another location, often another database or data store. This can either mean that each change is itself tracked so that the trail of how the data in one row got to be in the state that it is in is available, or it can just mean that a separate data store needs to be up to date with the primary one, so changes are replicated there, but any intermediate states are not tracked.
CDC pipelines can be useful for data replication, such as to a data warehouse, or for ETL (Extract, Transform, Load) jobs.
How are CDC events produced?
Because CDC events represent changes to rows in the database, they are typically generated by the database itself. The two most common ways are:
- Read from the write-ahead log (WAL). This allows for changes to be captured asynchronously, while still providing low-latency.
- Use database triggers to write changes to a specific table in the database built for storing these changes. This also provides a very low-latency solution, but happens synchronously, which may lead to some performance issues.