All too often, highly coupled and complex monolithic applications get in the way of attempts by businesses to scale their processes and implement more functionality. Well-established monolithic systems become increasingly complex and entangled unless they’re maintained carefully and regularly. And as the situation continues, resolving bottlenecks becomes increasingly challenging. In this article, we will guide you on how to utilize Change Data Capture to break free from this vicious circle and effectively transform your system.
The Challenge: An outdated monolithic system
Imagine you’re responsible for a well-established monolithic system — and you have to adapt it to your market’s current conditions and requirements. Such a task would require a monumental effort and vast resources. But if you can carve out existing functionality from that legacy monolithic system, resolving bottlenecks and speeding up development will suddenly become far easier. Leveraging the capabilities of Change Data Capture (CDC), delivering a system modernization effort can become relatively quick and cost-effective.
Let’s break down the process together.
What Is Change Data Capture?
Change Data Capture is a vital set of various patterns of continuous data tracking, promising reliable integration. Its primary goal is to promptly capture data changes — either in real-time or near real-time — to ensure the smooth transfer and alignment of data across various systems or databases.
If you need to maintain data integrity and ensure the seamless synchronization of data, you’re going to need to know about the various CDC implementation methods. Knowing about them will help you make informed decisions about which data synchronization pattern to choose. And choosing the right one can give end-users access to consistent and up-to-date information.
Consider a Scenario
Imagine the IT department of a furniture and decoration business called "HomeDecor" is growing increasingly concerned about a serious bottleneck in their system. The company regularly receives new shipments of furniture — triggering a mass update to the system’s only database every time.
Each time the database is bombarded with requests, the system’s ordering flow can’t access the relevant tables — the result of the database being pessimistically blocked.
So, like any large business struggling with such a scenario, the bosses at HomeDecor enlist the services of a software architecture consultant. Her name is Amy, and one of her first ideas is to extract a service to balance the load distribution. But she’s concerned about straining the running system. What does she do?
Amy turns to Change Data Capture (CDC). Why? Because it can deliver reliable data integration — through a set of continuous data integration patterns.
She knows that CDC is excellent at synchronizing Read Replicas, analyzing data streams from data warehouses, and publishing change events into data streams within event-driven architectures. But Amy needs to figure out how the different patterns vary, their various advantages and drawbacks, and how she can leverage CDC to extract functionality from HomeDecor’s monolith.
Let’s join Amy as she considers ways to implement CDC.
Pull-Based Approaches to Change Data Capture Implementation
Amy starts by considering pull-based methods, which are often viewed as an initial iteration of the CDC due to their relative simplicity. They also have certain disadvantages compared to push-based techniques.
The key difference lies in the direction of data transmission to the target database: In pull-based approaches, the target system must routinely poll the source system to detect new changes in the source database. This can be achieved through three key methods:
1. Row Versioning
Using this method, every table in the original database that needs to track changes gets extended with a row version column. When a change happens to a table entry, its version number goes up. This increase can happen either through a database trigger or directly within application logic. The target system then checks the original database at set intervals to find any new or updated entries with higher version numbers than those in its database.
2. Timestamps
Instead of using row versioning, Amy considers using a timestamp column to extend the source tables. When an existing row changes, the column’s value is updated with the current time. At each polling interval the target system only retrieves the data that was changed after the previous synchronization.
Take a look at the table below to see how this might apply to HomeDecor’s system.
What About Deleted Entries?
Both timestamp and row versioning can help Amy in her effort to extract data from HomeDecor’s legacy monolith, but she’s concerned about deleted entries. She knows that this issue can’t be solved with the addition of a column, so what does she do?
Amy realizes she has two options: either resort to soft deleting (where data is flagged as deleted but not actually removed) or implement a more complex synchronization mechanism that would involve deleting data on the target system if the corresponding entry can no longer be found in the source system.
3. Trigger-Based Approaches
Modern database systems generally offer the functionality to set triggers that activate when a record is inserted, updated, or deleted. Amy thinks she can leverage this approach to create a transaction table that logs every change to the source system. That way, the target system can periodically check the transaction table to synchronize its Read Replica.
Let’s take a look at how such a transaction table might look in HomeDecor’s database.
The Disadvantage of Pull-Based Change Data Capture
Pull-based methods have a significant disadvantage: they apply an additional overhead to the source system. Each operation requires the execution of another database write to make the data accessible to the target system in a practical way.
Amy realizes that this might cause problems for HomeDecor’s system — particularly during bulk data insertions. That’s why she immediately turns her attention to push-based approaches to CDC.
Push-Based Approaches to Change Data Capture Implementation
Amy considers the implementation of a notification system that produces events based on row-level changes — an advance on the trigger-based concept. If she can push these changes to the target system, there’d be no need for periodic polling.
The transaction log in modern database systems is crucial to the rolling back of aborted transactions — guaranteeing atomicity and facilitating failure recovery. Today’s powerful Change Data Capture tools can read transaction logs continuously.
Widely used databases, including Microsoft SQL Server, go a step further by offering integrated functionality that populates built-in CDC tables with relevant data extracted from the transaction log. This approach simplifies the process of accessing information about altered data.
Modern CDC tools can transform the content of the transaction log into discrete change events — containing information such as pre and post-operation values, the transaction ID, commit time, the type of operation, and the timestamp marking the creation of the change event. These events can then be published to an event stream for consumption by the target system.
The Pros and Cons of Push-Based Change Data Capture Implementation
Amy must consider the advantages and drawbacks of push-based CDC methods in her quest to extract functionality from HomeDecor’s bottlenecked monolith.
Pull-based CDC ensures near-real-time data availability by continuously monitoring transaction logs, thus minimizing delays in change event generation. Push-based CDC guarantees the capture of all changes to individual rows, while pull-based methods risk missing updates if multiple changes occur between polling intervals. Additionally, push-based CDC eliminates the need for extra columns in source database tables and custom logic for tracking deletions.
Amy decides that push-based CDC is preferable for event-based architectures needing the freshest data. However, she’s concerned that incorporating change events and event streams can add unnecessary complexity — potentially over engineering smaller applications. And that’s not all. Some of the database systems may not be compatible with the CDC connector provider.
A Modern Standard Approach: Debezium and Apache Kafka
In recent years, the open-source platform Debezium has emerged as the leading solution for implementing push-based CDC. Debezium operates on the Apache Kafka event streaming platform.
Debezium is usually deployed using a Kafka Connect Service, which allows for publishing change events into a Kafka stream. The platform can also be integrated as an embedded Java library, which allows for close integration and makes it easy to manipulate change events using custom business logic.
An Effective Solution for HomeDecor
After weighing up the pros and cons of the push and pull-based approaches to Change Data Capture Architecture, Amy comes up with a possible solution for HomeDecor.
Let’s consider the problem Amy is facing again. She’s addressing the bottlenecks in HomeDecor’s system caused by the arrival of new shipments that trigger mass updates to the business’s only database. She believes she has a solution.
What if an extracted Product Listing Service could provide information on whether certain products are available or not? This solution would bypass the company’s monolithic database by utilizing a Read Replica — a copy of the database’s relevant tables that’s synchronized with CDC.
This copying mechanism allows HomeDecor to map the existing data to a model more appropriate to the extracted service. Not only that, but it can also receive updates from the database at a non-blocking rate — implying eventual consistency with the monolith’s data.
To make this solution work, Amy decides to adapt the monolith’s fetch logic to ensure end-users can access real-time product availability data from the extracted service going forward.
Success! HomeDecor’s end-users are no longer blocked when bulk inserts occur during their shopping spree. Why? Because, from this point, end-users will retrieve the availability of products through the newly introduced service.
In the second blog post of this two-part series, we’ll examine this proposed architecture in more detail. Which method of CDC implementation would prove most suitable in this scenario? We’ll take a look at the drawbacks and caveats that apply to each method. And we’ll investigate ways to extract even more monolith functionality using CDC.
We’ll also ask the burning question: Can Change Data Capture be applied to other scenarios?
For an in-depth discussion of these questions — and more just like them — keep an eye out for part two of this CDC blog series.
In the meantime, we have more articles on system modernization ready for you:
Read Christoph Havlicek’s overview on Monolith Migration Strategies or dive deeper into Event-Driven Architectures with David Leitner’s take on alternatives to the widely used Outbox Pattern.