Event-Driven Architecture on AWS: 5 Patterns That Actually Work
Event-driven architecture (EDA) is one of those terms that sounds simple until you try to implement it. The theory is clean: services emit events, other services react. In practice, you face message ordering, duplicate delivery, poison messages, and observability gaps that make debugging a nightmare.
After 15+ years building event-driven systems on AWS — including high-throughput payment pipelines — here are five patterns that consistently work in production.
1. Fan-Out with SNS + SQS
When to use: One event needs to trigger multiple independent consumers.
Producer → SNS Topic → SQS Queue A → Consumer A
→ SQS Queue B → Consumer B
→ SQS Queue C → Consumer C
Each consumer gets its own queue, so they process independently. If Consumer B is slow or failing, it doesn’t block A or C. The SQS queues provide buffering and retry.
Key trade-off: You pay for the SNS → SQS hop (it’s cheap, but adds ~50-100ms latency). If you need sub-millisecond, this isn’t the pattern.
Production tip: Always attach a dead-letter queue (DLQ) to each SQS queue. Set maxReceiveCount to 3-5. Monitor DLQ depth with a CloudWatch alarm — if messages land there, something is wrong.
2. Event Bus with EventBridge
When to use: You need content-based routing — different consumers react to different event types or field values.
Producer → EventBridge → Rule A (order.created, amount > 1000) → Lambda
→ Rule B (order.created, region = EU) → SQS
→ Rule C (order.cancelled) → Step Functions
EventBridge rules are essentially filters. You define which events match, and EventBridge routes them. No consumer code needed for routing logic.
Key trade-off: EventBridge has a 256KB event size limit and doesn’t guarantee ordering. For large payloads, put the data in S3 and send a reference event.
Production tip: Use EventBridge schema discovery in development to auto-generate event schemas. In production, define schemas explicitly in your infrastructure-as-code.
3. Change Data Capture with DynamoDB Streams
When to use: You need downstream systems to react to database changes — without modifying the write path.
Application → DynamoDB Table → DynamoDB Stream → Lambda → SNS/SQS/EventBridge
DynamoDB Streams capture every insert, update, and delete as a time-ordered sequence. Your Lambda receives the old and new image of the item, so you can compute exactly what changed.
Key trade-off: Streams are per-table. If you have 50 tables, you need 50 stream processors. This is fine for a few critical tables, but doesn’t scale as a general-purpose event bus.
Production tip: Use FilterCriteria on the Lambda event source mapping to reduce invocations. If you only care about status changes, filter for eventName = 'MODIFY' and check that the status field actually changed.
4. Saga Orchestration with Step Functions
When to use: A business process spans multiple services and needs compensating transactions if any step fails.
Step Functions:
1. Reserve inventory → success
2. Charge payment → success
3. Ship order → FAILURE
4. Compensate: refund payment
5. Compensate: release inventory
Step Functions give you a visual state machine, automatic retry with backoff, and built-in error handling. Each step calls a Lambda or directly integrates with AWS services.
Key trade-off: Step Functions add latency (each state transition takes ~100ms). For hot-path, high-throughput workflows, consider choreography (pattern 5) instead.
Production tip: Use Express Workflows for high-volume, short-duration flows (< 5 minutes). Use Standard Workflows for long-running processes that need durability.
5. Choreography with SQS and Idempotency
When to use: Services need to collaborate without a central orchestrator, and you want maximum autonomy.
Order Service → "order.created" → SQS → Inventory Service
→ Payment Service
→ Notification Service
Each service reacts independently. There’s no coordinator. This gives you maximum decoupling but requires careful design to handle failures.
Key trade-off: Debugging is harder because there’s no single place that shows the full flow. You must invest in correlation IDs and distributed tracing.
Production tip: Every consumer must be idempotent. Use a DynamoDB table to track processed message IDs with a TTL. Before processing, check if the message ID exists. This handles SQS’s at-least-once delivery guarantee.
def handle_message(event):
message_id = event['messageId']
# Idempotency check
try:
table.put_item(
Item={'messageId': message_id, 'ttl': int(time.time()) + 86400},
ConditionExpression='attribute_not_exists(messageId)'
)
except ClientError as e:
if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
return # Already processed
raise
# Process the message
process(event)
Choosing the Right Pattern
| Pattern | Latency | Ordering | Complexity | Best For |
|---|---|---|---|---|
| SNS + SQS Fan-Out | Low | No | Low | Multi-consumer broadcast |
| EventBridge | Low | No | Medium | Content-based routing |
| DynamoDB Streams | Very low | Yes (per-partition) | Low | Change data capture |
| Step Functions | Medium | Yes (within workflow) | Medium | Multi-step transactions |
| Choreography | Low | No | High | Maximum service autonomy |
In practice, most systems use a combination. A payment service might use DynamoDB Streams for audit logging, SNS fan-out for notifications, and Step Functions for the checkout saga — all in the same system.
The key is to choose based on your actual requirements, not on what looks elegant in a diagram.
Need help designing event-driven systems? See our Event-Driven & Distributed Systems service or get in touch for a free architecture review.