REL11-BP04: Rely on the data plane and not the control plane during recovery

During widespread failures, control plane APIs may become unavailable or throttled. Design recovery mechanisms that depend on data plane operations rather than control plane operations. Use pre-provisioned resources, cached configurations, and avoid making API calls during critical recovery paths.

Implementation Steps

1. Pre-Provision Recovery Resources

Deploy standby resources in advance rather than creating them during recovery.

2. Cache Configuration Data

Store critical configuration data locally to avoid dependency on external APIs.

3. Use Data Plane Operations

Design recovery logic to use data plane operations that remain available during control plane issues.

4. Implement Static Routing

Configure static routing and failover paths that don’t require API calls.

5. Local Decision Making

Enable components to make recovery decisions based on local information.

Detailed Implementation

AWS Services

Primary Services

  • Amazon S3: Store configuration backups and recovery scripts
  • Amazon Route 53: DNS failover with health checks (data plane operations)
  • Elastic Load Balancing: Traffic routing without API dependencies
  • Amazon EC2: Pre-provisioned standby instances

Supporting Services

  • AWS Systems Manager: Parameter Store for configuration caching
  • Amazon CloudWatch: Metrics and alarms (data plane operations)
  • Amazon SQS: Asynchronous communication for recovery triggers
  • AWS Lambda: Event-driven recovery logic

Benefits

  • Control Plane Independence: Recovery works even when APIs are unavailable
  • Faster Recovery: Pre-provisioned resources eliminate provisioning delays
  • Reduced API Throttling: Avoid control plane rate limits during incidents
  • Higher Reliability: Less dependency on external services during recovery
  • Cost Optimization: Use stopped instances and cached data to reduce costs