Event-Driven Architecture on AWS: 5 Patterns That Actually Work

Event-driven architecture (EDA) is one of those terms that sounds simple until you try to implement it. The theory is clean: services emit events, other services react. In practice, you face message ordering, duplicate delivery, poison messages, and observability gaps that make debugging a nightmare.

After 15+ years building event-driven systems on AWS — including high-throughput payment pipelines — here are five patterns that consistently work in production.

1. Fan-Out with SNS + SQS

When to use: One event needs to trigger multiple independent consumers.

Producer → SNS Topic → SQS Queue A → Consumer A
                     → SQS Queue B → Consumer B
                     → SQS Queue C → Consumer C

Each consumer gets its own queue, so they process independently. If Consumer B is slow or failing, it doesn’t block A or C. The SQS queues provide buffering and retry.

Key trade-off: You pay for the SNS → SQS hop (it’s cheap, but adds ~50-100ms latency). If you need sub-millisecond, this isn’t the pattern.

Production tip: Always attach a dead-letter queue (DLQ) to each SQS queue. Set maxReceiveCount to 3-5. Monitor DLQ depth with a CloudWatch alarm — if messages land there, something is wrong.

2. Event Bus with EventBridge

When to use: You need content-based routing — different consumers react to different event types or field values.

Producer → EventBridge → Rule A (order.created, amount > 1000) → Lambda
                       → Rule B (order.created, region = EU) → SQS
                       → Rule C (order.cancelled) → Step Functions

EventBridge rules are essentially filters. You define which events match, and EventBridge routes them. No consumer code needed for routing logic.

Key trade-off: EventBridge has a 256KB event size limit and doesn’t guarantee ordering. For large payloads, put the data in S3 and send a reference event.

Production tip: Use EventBridge schema discovery in development to auto-generate event schemas. In production, define schemas explicitly in your infrastructure-as-code.

3. Change Data Capture with DynamoDB Streams

When to use: You need downstream systems to react to database changes — without modifying the write path.

Application → DynamoDB Table → DynamoDB Stream → Lambda → SNS/SQS/EventBridge

DynamoDB Streams capture every insert, update, and delete as a time-ordered sequence. Your Lambda receives the old and new image of the item, so you can compute exactly what changed.

Key trade-off: Streams are per-table. If you have 50 tables, you need 50 stream processors. This is fine for a few critical tables, but doesn’t scale as a general-purpose event bus.

Production tip: Use FilterCriteria on the Lambda event source mapping to reduce invocations. If you only care about status changes, filter for eventName = 'MODIFY' and check that the status field actually changed.

4. Saga Orchestration with Step Functions

When to use: A business process spans multiple services and needs compensating transactions if any step fails.

Step Functions:
  1. Reserve inventory → success
  2. Charge payment → success
  3. Ship order → FAILURE
  4. Compensate: refund payment
  5. Compensate: release inventory

Step Functions give you a visual state machine, automatic retry with backoff, and built-in error handling. Each step calls a Lambda or directly integrates with AWS services.

Key trade-off: Step Functions add latency (each state transition takes ~100ms). For hot-path, high-throughput workflows, consider choreography (pattern 5) instead.

Production tip: Use Express Workflows for high-volume, short-duration flows (< 5 minutes). Use Standard Workflows for long-running processes that need durability.

5. Choreography with SQS and Idempotency

When to use: Services need to collaborate without a central orchestrator, and you want maximum autonomy.

Order Service → "order.created" → SQS → Inventory Service
                                       → Payment Service
                                       → Notification Service

Each service reacts independently. There’s no coordinator. This gives you maximum decoupling but requires careful design to handle failures.

Key trade-off: Debugging is harder because there’s no single place that shows the full flow. You must invest in correlation IDs and distributed tracing.

Production tip: Every consumer must be idempotent. Use a DynamoDB table to track processed message IDs with a TTL. Before processing, check if the message ID exists. This handles SQS’s at-least-once delivery guarantee.

def handle_message(event):
    message_id = event['messageId']
    
    # Idempotency check
    try:
        table.put_item(
            Item={'messageId': message_id, 'ttl': int(time.time()) + 86400},
            ConditionExpression='attribute_not_exists(messageId)'
        )
    except ClientError as e:
        if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
            return  # Already processed
        raise
    
    # Process the message
    process(event)

Choosing the Right Pattern

Pattern	Latency	Ordering	Complexity	Best For
SNS + SQS Fan-Out	Low	No	Low	Multi-consumer broadcast
EventBridge	Low	No	Medium	Content-based routing
DynamoDB Streams	Very low	Yes (per-partition)	Low	Change data capture
Step Functions	Medium	Yes (within workflow)	Medium	Multi-step transactions
Choreography	Low	No	High	Maximum service autonomy

In practice, most systems use a combination. A payment service might use DynamoDB Streams for audit logging, SNS fan-out for notifications, and Step Functions for the checkout saga — all in the same system.

The key is to choose based on your actual requirements, not on what looks elegant in a diagram.

Need help designing event-driven systems? See our Event-Driven & Distributed Systems service or get in touch for a free architecture review.