REL11-BP04: Rely on the data plane and not the control plane during recovery

During widespread failures, control plane APIs may become unavailable or throttled. Design recovery mechanisms that depend on data plane operations rather than control plane operations. Use pre-provisioned resources, cached configurations, and avoid making API calls during critical recovery paths.

Implementation Steps

1. Pre-Provision Recovery Resources

Deploy standby resources in advance rather than creating them during recovery.

2. Cache Configuration Data

Store critical configuration data locally to avoid dependency on external APIs.

3. Use Data Plane Operations

Design recovery logic to use data plane operations that remain available during control plane issues.

4. Implement Static Routing

Configure static routing and failover paths that don’t require API calls.

5. Local Decision Making

Enable components to make recovery decisions based on local information.

Detailed Implementation

AWS Services

Primary Services

Amazon S3: Store configuration backups and recovery scripts
Amazon Route 53: DNS failover with health checks (data plane operations)
Elastic Load Balancing: Traffic routing without API dependencies
Amazon EC2: Pre-provisioned standby instances

Supporting Services

AWS Systems Manager: Parameter Store for configuration caching
Amazon CloudWatch: Metrics and alarms (data plane operations)
Amazon SQS: Asynchronous communication for recovery triggers
AWS Lambda: Event-driven recovery logic

Benefits

Control Plane Independence: Recovery works even when APIs are unavailable
Faster Recovery: Pre-provisioned resources eliminate provisioning delays
Reduced API Throttling: Avoid control plane rate limits during incidents
Higher Reliability: Less dependency on external services during recovery
Cost Optimization: Use stopped instances and cached data to reduce costs