REL11-BP04: Rely on the data plane and not the control plane during recovery
During widespread failures, control plane APIs may become unavailable or throttled. Design recovery mechanisms that depend on data plane operations rather than control plane operations. Use pre-provisioned resources, cached configurations, and avoid making API calls during critical recovery paths.
Implementation Steps
1. Pre-Provision Recovery Resources
Deploy standby resources in advance rather than creating them during recovery.
2. Cache Configuration Data
Store critical configuration data locally to avoid dependency on external APIs.
3. Use Data Plane Operations
Design recovery logic to use data plane operations that remain available during control plane issues.
4. Implement Static Routing
Configure static routing and failover paths that don’t require API calls.
5. Local Decision Making
Enable components to make recovery decisions based on local information.
Detailed Implementation
AWS Services
Primary Services
- Amazon S3: Store configuration backups and recovery scripts
- Amazon Route 53: DNS failover with health checks (data plane operations)
- Elastic Load Balancing: Traffic routing without API dependencies
- Amazon EC2: Pre-provisioned standby instances
Supporting Services
- AWS Systems Manager: Parameter Store for configuration caching
- Amazon CloudWatch: Metrics and alarms (data plane operations)
- Amazon SQS: Asynchronous communication for recovery triggers
- AWS Lambda: Event-driven recovery logic
Benefits
- Control Plane Independence: Recovery works even when APIs are unavailable
- Faster Recovery: Pre-provisioned resources eliminate provisioning delays
- Reduced API Throttling: Avoid control plane rate limits during incidents
- Higher Reliability: Less dependency on external services during recovery
- Cost Optimization: Use stopped instances and cached data to reduce costs