Modernising data convergence in AWS through AWS DMS replication and Data Lake architecture

Data Convergence refers to the process of consolidating disparate data sources (OLTP databases, ERPs, CRMs, logs) into a single, cohesive repository to enable unified analytics, machine learning, and reporting.


Building a Data Convergence pipeline using AWS Database Migration Service (DMS) and an S3-based Data Lake architecture is an industry-standard pattern. It allows you to move data from legacy transactional databases into a highly scalable storage layer with minimal impact on your production workloads.


Data convergence is about bringing data from multiple operational systems into a single, consistent platform where it can be:


  • Centralised in one place (your data lake on Amazon S3)
  • Kept in sync with source changes (using CDC from AWS DMS)
  • Standardised and governed (schemas, quality, security)
  • Served to many consumers (BI, ML, real‑time analytics)


In AWS, the usual pattern is:

Operational databases → AWS DMS → S3 data lake → Processing (Glue/EMR) → Lakehouse/warehouse (Athena, Redshift, Iceberg/Delta)

Components


1. Source Systems


Multiple heterogeneous databases such as:

  • Oracle
  • SQL Server
  • MySQL
  • PostgreSQL
  • SAP databases
  • MongoDB (supported versions)

These contain transactional business data.


2. AWS DMS (Database Migration Service)


AWS DMS performs:


  • Initial full load
  • Continuous Change Data Capture (CDC)
  • Schema conversion (with AWS SCT if required)


Output targets include:

  • Amazon S3
  • Amazon Redshift
  • Amazon RDS
  • Amazon Aurora
  • Apache Kafka
  • Amazon Kinesis


For a data lake, Amazon S3 is the most common destination.


3. Raw Data Lake (Landing Zone)


Data is stored exactly as received.


Example structure:


s3://company-data-lake/raw/

oracle/
customer/
orders/

sqlserver/
employee/

mysql/
sales/


Typical formats:

  • CSV
  • JSON
  • Parquet

Partitioning example:


year=2026/
month=07/
day=01/


4. AWS Glue Catalog


AWS Glue Crawlers automatically discover:


  • Tables
  • Schemas
  • Partitions


The Glue Data Catalog acts as the metadata layer.


5. Data Transformation


AWS Glue ETL jobs perform:


  • Data cleansing
  • Standardization
  • Data quality checks
  • Deduplication
  • Joining datasets
  • Data enrichment


Example:


Customer Oracle
+
Sales SQL Server
+
CRM PostgreSQL



Unified Customer Table



6. Curated Data Lake


Store optimized datasets.


Preferred formats:

  • Apache Parquet
  • Apache Iceberg
  • Delta Lake (if using compatible tools)


Benefits:

  • Columnar storage
  • Compression
  • Faster queries
  • Lower storage cost

Example:


s3://company-data-lake/curated/

customers/
sales/
finance/
inventory/


7. Analytics Layer


Data can be queried using:


  • Amazon Athena
  • Amazon Redshift Spectrum
  • Amazon EMR
  • Apache Spark
  • Amazon SageMaker


Visualization:

  • Amazon QuickSight
  • Tableau
  • Power BI



Setting Up AWS DMS for Convergence


AWS DMS accomplishes data convergence through two distinct phases: Full Load (migrating the existing snapshot) and CDC (Change Data Capture) (replicating ongoing data modifications in near real-time).


Step 1: Network & Prerequisites


  • Ensure the DMS Replication Instance resides in a Virtual Private Cloud (VPC) with security groups configured to access your source databases.


  • Enable logical replication or binary logging on your source system (e.g., wal_level = logical for PostgreSQL, or enabling binlog for MySQL) so DMS can parse transaction logs without degrading performance.


Step 2: Configure DMS Endpoints


  • Source Endpoint: Connects to your production database via credentials stored securely in AWS Secrets Manager.


  • Target Endpoint: Points to your Amazon S3 Bronze bucket.
Best Practice Endpoint Settings: > Set the target format to Apache Parquet. Parquet is a columnar storage format that compresses data heavily (saving ~70% on storage costs) and significantly accelerates analytical queries compared to CSV.

Step 3: Run Full Load + CDC Tasks


Configure a DMS Migration Task with the option Migrate existing data and replicate ongoing changes .


  • Full Load: DMS dumps the existing database records into S3 partitioned by table name.


  • CDC: DMS continuously tailors transaction logs. Whenever an INSERT , UPDATE , or DELETE happens at the source, DMS outputs a new file to S3 containing the data payload alongside a structural header metadata tag (e.g., Op flag indicating 'I' , 'U' , or 'D' ).




Production Best Practices


  • DMS Serverless: Consider using DMS Serverless for replication tasks; it scales data capacity units (DCUs) automatically based on transaction volume, reducing idle infrastructure costs.


  • File Size Optimisation: Configure your DMS S3 target settings using CdcMaxBatchInterval and CdcMinFileSize . This prevents the "small file problem" (creating thousands of tiny S3 files that cripple query performance). Aim for files between 128MB and 512MB.


  • Partitioning: Ensure DMS partitions data by timestamp (e.g., year=YYYY/month=MM/day=DD/ ). This allows Athena to skip files outside the query range, optimising performance and reducing query bills.


By RAVI R May 28, 2026
Designing Resilient Cloud Systems: Principles and Best Practices for Modern Applications
April 22, 2026
As organisations accelerate their shift to cloud-native architectures, many are no longer relying on a single provider. Instead, they operate across multiple platforms public, private, and hybrid creating what’s known as a multi-cloud environment. While this approach offers flexibility, resilience, and vendor independence, it also introduces a sprawling attack surface. Traditional perimeter-based security models struggle to keep up. Cloud computing, remote work, mobile devices, and third-party integrations have dissolved the once-clear boundaries between "inside" and "outside" an organisation’s network. As a result, a new approach to cybersecurity has emerged: Zero Trust. By 2026, Zero Trust Architecture (ZTA) has transitioned from a buzzword to a mandatory framework for managing the complexities of multi-cloud security. What is Zero Trust ? Zero Trust is a security model built on a simple but powerful principle: never trust, always verify. Rather than assuming that anything inside a network is safe, Zero Trust requires continuous authentication, authorisation, and validation of every user, device, and workload—regardless of where it originates. This means that even if a user is already inside the network, they must still prove their identity and legitimacy every time they attempt to access systems or data. similar to someone inside office but still need ID card to open the doors. In a multi-cloud world, where systems are distributed across providers and geographies, this approach becomes essential rather than optional. Why Zero Trust Matters ? Traditional security models rely heavily on perimeter defenses like firewalls and VPNs. While these tools are still useful, they are no longer sufficient on their own. Cyber threats have evolved, attackers often gain access through compromised credentials or insider vulnerabilities, then move laterally within the network. Zero Trust addresses these challenges by: Reducing the risk of unauthorised access Limiting lateral movement within systems Enhancing visibility into user and device behavior Strengthening protection for sensitive data Core Principles of Zero Trust in Multi-Cloud A successful Zero Trust strategy typically rests on several foundational principles: 1. Identity as the New Perimeter In Zero Trust, identity replaces the traditional network perimeter. Every request must be authenticated using strong identity controls, such as multi-factor authentication (MFA) and adaptive access policies. In multi-cloud setups, this means federating identity across platforms so users can be verified consistently, regardless of where resources are hosted. 2. Least Privilege Access Users and services should only have access to what they absolutely need and nothing more. This minimises the blast radius if credentials are compromised. Implementing least privilege across clouds requires centralised policy management and continuous auditing of permissions. 3. Assume Breach Zero Trust operates under the assumption that threats may already exist within the network. This mindset drives continuous monitoring and rapid response. 4. Verify Explicitly Every access request must be authenticated and authorized using all available data points, including user identity, device health, location, and behavior patterns. 5. Continuous Monitoring and Verification Trust is never permanent. Even after access is granted, behavior must be continuously monitored for anomalies. This includes: Real-time threat detection Behavioral analytics Automated response mechanisms 6. Micro-Segmentation Instead of one large, flat network, Zero Trust divides environments into smaller, isolated segments. Each segment enforces its own access controls. In multi-cloud environments, micro-segmentation prevents lateral movement between workloads—even across different providers. 7. Device and Workload Security Every endpoint, whether it’s a laptop, container, or virtual machine, It must be verified before accessing resources. Security checks may include: Device posture validation Patch level verification Runtime workload protection Key Components of a Zero Trust Strategy Implementing Zero Trust involves a combination of technologies, policies, and cultural changes: 1. Identity and Access Management (IAM) Strong authentication mechanisms such as multi-factor authentication (MFA), ensure that users are who they claim to be. 2. Device Security Only trusted and compliant devices should be allowed to access resources. This includes enforcing security updates and endpoint protection. 3. Network Segmentation Breaking the network into smaller segments prevents attackers from moving freely if they gain access. 4. Data Protection Sensitive data should be encrypted, classified, and monitored to prevent unauthorised access or leakage. 5. Continuous Monitoring and Analytics Real-time monitoring helps detect unusual behavior and respond quickly to potential threats. The Strategic Benefits of Zero Trust in Multi‑Cloud Organisations that embrace Zero Trust gain more than security. Reduced breach impact through segmentation and least privilege Faster cloud adoption with consistent controls Improved compliance across jurisdictions Operational resilience even when one cloud provider experiences issues Better user experience with modern identity solutions Zero Trust becomes a business enabler, not a bottleneck. Practical Steps to Implement Zero Trust Across Clouds A realistic roadmap looks like this: Start with identity: unify IAM and enforce MFA everywhere. Map your data flows: understand what moves between clouds. Segment your networks and workloads: shrink the attack surface. Adopt cloud‑agnostic security tooling: avoid vendor lock‑in. Automate everything: policy enforcement, access reviews, threat response. Continuously measure maturity: Zero Trust is a journey, not a destination. Security Without Borders Multi‑cloud is the new normal. The organisations that thrive in it will be the ones that treat security as a distributed, adaptive, identity‑driven discipline. Zero Trust provides the blueprint for a world where data flows across borders, clouds, and platforms, without sacrificing control. By shifting the focus from location to identity, from trust to verification, organizations can build a security posture that truly has no borders. Need further assistance? How can we help ? Brainstorming: Exploring fresh ideas or building on existing ones. Problem Solving: Working through technical, logical, or creative challenges. Organisation: Bringing structure to your thoughts, plans, or information. Clarity: Breaking down complex ideas into clear, simple explanations. Implementation: Helping you turn ideas into actionable steps, plans, or real-world execution.
Show More