How CDC Checkpoints Work

AWS DMS CDC (Change Data Capture) checkpoints work as a tracking mechanism to ensure data consistency during ongoing replication. Here's how they work and how to recover from failures:

  1. Checkpoint Creation:

    • DMS periodically records its progress position in the source database's transaction logs

    • For Oracle, this is the SCN (System Change Number)

    • For PostgreSQL, this is the LSN (Log Sequence Number)

    • For MySQL/MariaDB, this is the binary log position

  2. Checkpoint Storage:

    • These checkpoints are stored internally by DMS in a special control table

    • DMS automatically manages these checkpoints behind the scenes

    • They're used to track which transactions have been processed

  3. Consistency Guarantee:

    • Checkpoints ensure that DMS doesn't miss transactions or process them twice

    • DMS operates in a "exactly-once" delivery model for transactions

    • Transactions are committed in the same order they occurred in the source

Recovering from CDC Failures

When a DMS CDC task fails, you have several recovery options:

  1. Automatic Restart:

    • Simply restart the task

    • DMS will automatically resume from the last checkpoint

    • Example command:

      aws dms start-replication-task \  --replication-task-arn your-task-arn \  --start-replication-task-type resume-processing
  2. Start from a Specific Checkpoint:

    • You can explicitly specify a CDC start position

    • Useful when you need to reprocess transactions from a particular point

    • For Oracle example:

      aws dms modify-replication-task \  --replication-task-arn your-task-arn \  --cdc-start-position "Oracle SCN:6916533"aws dms start-replication-task \  --replication-task-arn your-task-arn \  --start-replication-task-type start-replication
  3. Start from a Timestamp:

    • You can specify a point in time to resume CDC

    • DMS will identify the corresponding transaction log position

    • Example:

      aws dms modify-replication-task \  --replication-task-arn your-task-arn \  --cdc-start-time "2023-03-15T14:30:00Z"aws dms start-replication-task \  --replication-task-arn your-task-arn \  --start-replication-task-type start-replication

Common Recovery Challenges and Solutions

  1. Transaction Log Purging:

    • Problem: Source database has purged logs needed for recovery

    • Solution:

      • Perform a new full load with CDC

      • Configure longer log retention on source systems

      • Set up regular checkpoint validation

  2. Large Backlogs:

    • Problem: DMS falls far behind, with millions of changes to process

    • Solution:

      • Scale up the replication instance

      • Consider batch change processing settings

      • Monitor CDC latency metrics in CloudWatch

  3. Connectivity Issues:

    • Problem: Network problems cause intermittent failures

    • Solution:

      • Increase retry settings

      • Implement VPC endpoints for more reliable connectivity

      • Check security group and network ACL settings

  4. Data Definition Language (DDL) Changes:

    • Problem: Schema changes cause replication failures

    • Solution:

      • Configure how DMS handles DDL changes

      • Test schema changes in non-production environments first

      • Consider stopping and reconfiguring tasks for major schema changes

The key to successful CDC recovery is proper monitoring and ensuring your source database retains transaction logs long enough for recovery to be possible. For critical systems, regularly testing recovery procedures and monitoring source database log retention is essential.

Last updated

Was this helpful?