AWS High Availability and Disaster Recovery
Last updated
Was this helpful?
Last updated
Was this helpful?
Business Continuity (BC)
Focus on minimizing business disruption
Preventive and planning measures
Defines acceptable service levels
Disaster Recovery (DR)
Response to events threatening continuity
Reactive measures and procedures
Implementation of recovery plans
Fault Tolerance
Complete ability to tolerate faults
No user-visible disruption
Often more expensive to implement
Zero downtime objective
High Availability
Allows for minimal unplanned downtime
More cost-effective approach
Balance between reliability and cost
Accepts some potential disruption
Commitments for quality/availability
Not absolute guarantees
May include compensation provisions
Requires customer documentation for claims
Recovery Time Objective (RTO)
Time to restore business processes
Measured from disruption to recovery
Defines acceptable downtime
Key factor in BC planning
Recovery Point Objective (RPO)
Acceptable data loss measured in time
Gap between last backup and incident
Influences backup strategy
Drives data protection requirements
Date: February 28, 2017
Location: AWS Northern Virginia (us-east-1)
Duration: 252 minutes
Cause: Human error during system maintenance
Exceeded 99.99% availability goal
Affected multiple dependent services
Disrupted AWS's own health dashboard
Demonstrated cascading failure risks
Human error can bypass safeguards
Dependencies create vulnerability chains
Need for cross-region resilience
Importance of proper change management
Hardware Failures
Network equipment malfunctions
Storage system crashes
Physical infrastructure issues
Software Failures
Deployment errors
Configuration mistakes
Application crashes
Load-Related
DDoS attacks
Traffic spikes
Resource exhaustion
Infrastructure
Fiber cuts
Power outages
Facility issues
Data-Induced
Corruption
Type conversion errors
Invalid data processing
Credential Issues
Certificate expirations
Key rotations
Authentication failures
Capacity Issues
Instance type unavailability
Storage limitations
Network bandwidth constraints
Identifier Exhaustion
IP address depletion
Resource ID limitations
Namespace conflicts
Regular backup testing
Documentation maintenance
Staff training
Change management procedures
Multi-AZ Deployment
Cross-zone redundancy
Load balancing
Automatic failover
Cross-Region Solutions
Geographic distribution
Data replication
Traffic routing
Dependency Management
Service decoupling
Circuit breakers
Fallback mechanisms
Real-time monitoring
Automated alerts
Incident response procedures
Regular testing and drills
"Everything fails all the time" - Werner Vogels, Amazon CTO
This principle emphasizes:
Assume failure will occur
Design for resilience
Plan for recovery
Test failure scenarios
Balance between protection and expense
Tiered recovery solutions
Risk-based investments
Cost-benefit analysis
Regular DR drills
Failure simulation
Recovery validation
Documentation updates
SLA monitoring
Incident documentation
Recovery metrics
Compliance validation