Reliability
Questions
6 best practices
- REL01-BP01: BP01 - Aware of service quotas and constraints
- REL01-BP02: BP02 - Manage service quotas across accounts and regions
- REL01-BP03: BP03 - Accommodate fixed service quotas and constraints through architecture
- REL01-BP04: BP04 - Monitor and manage quotas
- REL01-BP05: BP05 - Automate quota management
- REL01-BP06: BP06 - Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover
5 best practices
- REL02-BP01: BP01 - Use highly available network connectivity for your workload public endpoints
- REL02-BP02: BP02 - Provision redundant connectivity between private networks in the cloud and on-premises environments
- REL02-BP03: BP03 - Ensure IP subnet allocation accounts for expansion and availability
- REL02-BP04: BP04 - Prefer hub-and-spoke topologies over many-to-many mesh
- REL02-BP05: BP05 - Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
7 best practices
- REL05-BP01: BP01 - Implement graceful degradation to transform applicable hard dependencies into soft dependencies
- REL05-BP02: BP02 - Throttle requests
- REL05-BP03: BP03 - Control and limit retry calls
- REL05-BP04: BP04 - Fail fast and limit queues
- REL05-BP05: BP05 - Set client timeouts
- REL05-BP06: BP06 - Make systems stateless where possible
- REL05-BP07: BP07 - Implement emergency levers
7 best practices
- REL06-BP01: BP01 - Monitor all components for the workload (Generation)
- REL06-BP02: BP02 - Define and calculate metrics (Aggregation)
- REL06-BP03: BP03 - Send notifications (Real-time processing and alarming)
- REL06-BP04: BP04 - Automate responses (Real-time processing and alarming)
- REL06-BP05: BP05 - Create dashboards
- REL06-BP06: BP06 - Review metrics at regular intervals
- REL06-BP07: BP07 - Monitor end-to-end tracing of requests through your system
5 best practices
- REL08-BP01: BP01 - Use runbooks for standard activities such as deployment
- REL08-BP02: BP02 - Integrate functional testing as part of your deployment
- REL08-BP03: BP03 - Integrate resiliency testing as part of your deployment
- REL08-BP04: BP04 - Deploy using immutable infrastructure
- REL08-BP05: BP05 - Deploy changes with automation
4 best practices
- REL09-BP01: BP01 - Identify and back up all data that needs to be backed up, or reproduce the data from sources
- REL09-BP02: BP02 - Secure and encrypt backups
- REL09-BP03: BP03 - Perform data backup automatically
- REL09-BP04: BP04 - Perform periodic recovery of the data to verify backup integrity and processes
7 best practices
- REL11-BP01: BP01 - Monitor all components of the workload to detect failures
- REL11-BP02: BP02 - Fail over to healthy resources
- REL11-BP03: BP03 - Automate healing on all layers
- REL11-BP04: BP04 - Rely on the data plane and not the control plane during recovery
- REL11-BP05: BP05 - Use static stability to prevent bimodal behavior
- REL11-BP06: BP06 - Send notifications when events impact availability
- REL11-BP07: BP07 - Architect your product to meet availability targets and uptime service level agreements (SLAs)
5 best practices
- REL13-BP01: BP01 - Define recovery objectives for downtime and data loss
- REL13-BP02: BP02 - Use defined recovery strategies to meet the recovery objectives
- REL13-BP03: BP03 - Test disaster recovery implementation to validate the implementation
- REL13-BP04: BP04 - Manage configuration drift at the DR site or region
- REL13-BP05: BP05 - Automate recovery
The Reliability pillar includes the ability to support development and run workloads effectively, gain insight into their operations, and to continuously improve supporting processes and procedures to deliver business value.
AWS Services for Reliability
Amazon CloudWatch
Monitors your AWS resources and the applications you run on AWS in real time.
AWS Auto Scaling
Monitors your applications and automatically adjusts capacity to maintain steady, predictable performance.
Amazon RDS
Makes it easy to set up, operate, and scale a relational database in the cloud with high availability.
AWS Elastic Disaster Recovery
Minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications.
AWS Backup
Centrally manages and automates backups across AWS services.
Elastic Load Balancing
Automatically distributes incoming application traffic across multiple targets.
Amazon Route 53
Provides highly available and scalable cloud Domain Name System (DNS) web service.