Releases: woodlee/sqlserver-cdc-to-kafka
CDC-to-Kafka 3.3.1
BUGFIX: Fixes schema and table name quoting in some SQL queries to handle a bug that arose for tables whose name was also a SQL keyword.
CDC-to-Kafka 3.3.0
Feature additions:
- Adds the
ALWAYS_USE_AVRO_LONGS
option, which maps all SQL integer types to Avrolong
fields, easing future column type upgrades. - Adds the
replayer.py
script as a demonstration of using the topics produced by this tool to create a copy of a table in another SQL Server database. - Snapshot prevention! This release adds logic to detect when a new full-table snapshot is not necessary, avoiding the associated flood of produced messages for certain kinds of schema changes. For example, the addition (and/or new CDC-tracking) of a nullable column on a table as long as that column still contains only null values.
- LSN gaps when upgrading to a new capture instance no longer require a new snapshot, as long as no new change rows were published to the prior capture instance since the most recent messages that were produced to Kafka.
Other:
- Exceptions raised in the
HttpPostReporter
metrics reporter are caught and logged as warnings to reduce clutter in error-tracking tools like Sentry. - Package upgrades, removal of some unused code, logging improvements, and refactoring to reduce the size of
main.py
.
CDC-to-Kafka 3.2.0
- Reverts to using Python v3.8 in the Dockerfile, due to process-hang issues associated with changes that were released in Python v3.9
- Implements limited in-process retrying of timed out SQL queries, instead of always relying on process crash and supervisor restart mechanics
- Makes the row batch size for DB queries configurable (but still defaults to the original 2,000)
- Logging improvements and cleanups
- Fixes a bug that could cause progress hearbeat messages to be emitted with an LSN lower than the last previously published progress message for a table
- Moves HTTP-based metrics reporting to a separate thread to reduce latency impact on the main process
- Introduces
progress_reset_tool.py
, which can be used to delete progress entries for specific topic(s), e.g. to trigger re-taking a snapshot without needing to create a new capture instance in the DB - Improves logic in both the progress-topic and regular-topic validator tools
- Allows use of pseudo-failover in cases where connections to the primary server time out entirely
CDC-to-Kafka 3.1.1
Fixes a bug from the prior release that caused snapshot completion recognition to fail when low values in the table PK had been deleted since the last snapshot.
CDC-to-Kafka 3.1.0
Changes in this version
- A new execution option,
REPORT_PROGRESS_ONLY
, was added. If set, the process starts, prints the table of its current progress against followed tables, and exits without making any changes. - Tracking of snapshot completion was improved. Previously, tables that had a PK that was not monotonically increasing (such as a GUID) would, at process start, begin a new snapshot to pick up any rows with PK values lower than the lowest value seen by a prior invocation's snapshot. Since such rows also appear as inserts in the change data events, this represented an unnecessary duplication and caused confusion. In addition, it was possible that a table that contained no rows when it was first tracked would be snapshotted the next time the process restarted, if rows had been added in the interim. Again this was unnecessary since the added rows would be present in the change data events. Both issues have been fixed.
- Cleaned up confusing and overly verbose logging that would be printed when something changed in the tracked capture instances on the SQL Server side.
- Fixed a bug that could cause the process to incorrectly believe that there was a coverage gap in the LSNs between old and new capture instances for a given table, particularly for low-change-volume tables.
- Upgrade package dependencies and the Python version used for the Docker image.
CDC-to-Kafka 3.0.0
This release contains breaking changes.
CDC-to-Kafka 3.0.0 brings several dependency upgrades, performance improvements, and expanded SQL type support. It also improves the flexibility and schema management of "unified topics", which can contain change data messages from several different SQL tables produced in a transactionally-consistent order.
Changes:
- Upgrades the MS ODBC driver used in the Docker image (Breaking: this means that if you are using a Docker image built from this repo's
Dockerfile
, your DB connection strings will need to change to useDRIVER=ODBC Driver 18 for SQL Server
, and may also need to addTrustServerCertificate=yes;
. - Adds support for SQL data types
money
,smallmoney
,datetimeoffset
,smalldatetime
,xml
,rowversion
,float
, andreal
(hopefully addressing #17). - Breaking for users of unified topics: Previously, unified-topic messages were wrapped in a top-level object with fields
__source_table
and__change_data
, the latter of which was encoded with a single Avro schema that was a union type of all the tracked tables' schemas. With this release, the top-level wrapping is dropped and messages produced to unified topics are now Avro-encoded with multiple schemas, corresponding to the same per-table schemas that are used for messages in the single-table topics. This change greatly improves performance when unified topics are used, since additional re-serializations of the same change datum are no longer needed. Advances in schema management tooling (e.g. support for new subject naming strategies in the Confluent schema registry) made this a more attractive option. Breaking aspects:- The schemas of messages in any unified topics will change, and may now vary from message to message. Ensure consumers are prepared for this before switching.
- Configuration params for unified topics have changed.
UNIFIED_TOPICS_PARTITION_COUNT
andUNIFIED_TOPICS_EXTRA_CONFIG
have been dropped as top-level config parameters; instead, these options can now be specified for each unified topic separately within the expanded JSON object expected by parameterUNIFIED_TOPICS
(see the help string incdc_kafka/options.py
for details). - With the removal of the top-level wrapping and its
__source_table
field, consumers will now need to rely on knowledge of the Avro schema to determine what SQL table a given message corresponds to. The Avro schemaname
for message values produced by this tool follows format<source_table_schema_name>_<source_table_name>_cdc__value
; consumers may need to be prepared to parse this.
- ~30% maximum throughput increase (and more for those who also produce to unified topics!)
- PyPI package dependencies upgraded
CDC-to-Kafka 2.2.2
- Fixes a bug whereby columns deleted from the base table but still present on the capture instance would cause snapshot SQL queries to incorrectly refer to the no-longer-extant columns
- Upgrades some external dependencies
- Improvement tweaks to messaging when running in validation mode
- Style fixes
CDC-to-Kafka 2.2.1
Bugfix: Prevent errors when field truncation is configured for a nullable string field
CDC-to-Kafka 2.2.0
This release adds automatic creation of unified-messages topics, with strong encouragement to keep them as single-partition topics so that in-order consumption is simplified.
CDC-to-Kafka 2.1.2
Tries to better handle exceptions like:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 392-393: illegal UTF-16 surrogate
...when the process encounters data in SQL Server that is not properly UTF-16-encoded.