REL05
REL05-BP01 - Implement graceful degradation to transform applicable hard dependencies into soft dependencies
REL05-BP01: Implement graceful degradation to transform applicable hard dependencies into soft dependencies
Overview
Design systems to gracefully degrade functionality when dependencies become unavailable, transforming hard dependencies that would cause complete system failure into soft dependencies that allow core functionality to continue. This approach maintains essential services while providing reduced functionality, ensuring better user experience and system resilience during partial outages.
Implementation Steps
1. Identify and Classify Dependencies
- Categorize dependencies as critical, important, or optional
- Map dependencies to specific features and functionality
- Identify which features can operate with reduced capability
- Document fallback strategies for each dependency type
2. Design Fallback Mechanisms
- Implement cached responses for unavailable services
- Create default behaviors when dependencies fail
- Design simplified workflows that bypass failed components
- Establish static content delivery for dynamic services
3. Implement Feature Toggles and Circuit Breakers
- Deploy feature flags to disable non-essential functionality
- Implement circuit breakers to detect and isolate failures
- Create automatic fallback activation based on health checks
- Design manual override capabilities for emergency situations
4. Establish Graceful User Experience
- Design user interfaces that adapt to reduced functionality
- Implement informative error messages and status indicators
- Provide alternative workflows when primary paths fail
- Maintain core user journeys even with degraded services
5. Implement Data and State Management
- Cache critical data locally for offline operation
- Design eventual consistency patterns for data synchronization
- Implement read-only modes when write operations fail
- Create data replication strategies for high availability
6. Monitor and Test Degradation Scenarios
- Implement monitoring for dependency health and fallback activation
- Create automated testing for degradation scenarios
- Establish alerting for when systems operate in degraded mode
- Regularly test fallback mechanisms and recovery procedures
Implementation Examples
Example 1: Graceful Degradation Framework
View code
import boto3
import json
import logging
import time
import asyncio
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Any, Callable, Union
from dataclasses import dataclass, asdict
from enum import Enum
from abc import ABC, abstractmethod
import threading
from contextlib import asynccontextmanager
class DependencyType(Enum):
CRITICAL = "critical"
IMPORTANT = "important"
OPTIONAL = "optional"
class DegradationLevel(Enum):
FULL_FUNCTIONALITY = "full_functionality"
REDUCED_FUNCTIONALITY = "reduced_functionality"
MINIMAL_FUNCTIONALITY = "minimal_functionality"
EMERGENCY_MODE = "emergency_mode"
@dataclass
class DependencyStatus:
name: str
dependency_type: DependencyType
is_healthy: bool
last_check: datetime
failure_count: int
response_time_ms: float
error_message: Optional[str] = None
@dataclass
class FallbackStrategy:
dependency_name: str
fallback_type: str
fallback_data: Dict[str, Any]
cache_ttl_seconds: int
enabled: bool = True
class DependencyHealthChecker:
"""Health checker for system dependencies"""
def __init__(self, config: Dict[str, Any]):
self.config = config
self.dependencies = {}
self.health_status = {}
self.check_interval = config.get('check_interval_seconds', 30)
self.failure_threshold = config.get('failure_threshold', 3)
def register_dependency(self, name: str, dependency_type: DependencyType,
health_check_func: Callable) -> None:
"""Register a dependency for health monitoring"""
self.dependencies[name] = {
'type': dependency_type,
'health_check': health_check_func,
'status': DependencyStatus(
name=name,
dependency_type=dependency_type,
is_healthy=True,
last_check=datetime.utcnow(),
failure_count=0,
response_time_ms=0.0
)
}
logging.info(f"Registered dependency: {name} ({dependency_type.value})")
async def start_health_monitoring(self):
"""Start continuous health monitoring"""
while True:
try:
await self._check_all_dependencies()
await asyncio.sleep(self.check_interval)
except Exception as e:
logging.error(f"Health monitoring error: {str(e)}")
await asyncio.sleep(self.check_interval)
async def _check_all_dependencies(self):
"""Check health of all registered dependencies"""
tasks = []
for name, dependency in self.dependencies.items():
task = asyncio.create_task(self._check_dependency_health(name, dependency))
tasks.append(task)
await asyncio.gather(*tasks, return_exceptions=True)
async def _check_dependency_health(self, name: str, dependency: Dict[str, Any]):
"""Check health of individual dependency"""
try:
start_time = time.time()
health_check_func = dependency['health_check']
# Execute health check
is_healthy = await health_check_func()
response_time = (time.time() - start_time) * 1000
# Update status
status = dependency['status']
status.is_healthy = is_healthy
status.last_check = datetime.utcnow()
status.response_time_ms = response_time
if is_healthy:
status.failure_count = 0
status.error_message = None
else:
status.failure_count += 1
status.error_message = "Health check failed"
self.health_status[name] = status
# Log status changes
if not is_healthy and status.failure_count >= self.failure_threshold:
logging.warning(f"Dependency {name} is unhealthy (failures: {status.failure_count})")
elif is_healthy and status.failure_count == 0:
logging.info(f"Dependency {name} is healthy (response time: {response_time:.2f}ms)")
except Exception as e:
logging.error(f"Health check failed for {name}: {str(e)}")
status = dependency['status']
status.is_healthy = False
status.failure_count += 1
status.error_message = str(e)
status.last_check = datetime.utcnow()
self.health_status[name] = status
def get_dependency_status(self, name: str) -> Optional[DependencyStatus]:
"""Get current status of a dependency"""
return self.health_status.get(name)
def get_system_degradation_level(self) -> DegradationLevel:
"""Determine current system degradation level"""
critical_failures = 0
important_failures = 0
for status in self.health_status.values():
if not status.is_healthy and status.failure_count >= self.failure_threshold:
if status.dependency_type == DependencyType.CRITICAL:
critical_failures += 1
elif status.dependency_type == DependencyType.IMPORTANT:
important_failures += 1
if critical_failures > 0:
return DegradationLevel.EMERGENCY_MODE
elif important_failures >= 2:
return DegradationLevel.MINIMAL_FUNCTIONALITY
elif important_failures >= 1:
return DegradationLevel.REDUCED_FUNCTIONALITY
else:
return DegradationLevel.FULL_FUNCTIONALITY
class GracefulDegradationManager:
"""Manager for graceful degradation strategies"""
def __init__(self, config: Dict[str, Any]):
self.config = config
self.health_checker = DependencyHealthChecker(config.get('health_checker', {}))
self.fallback_strategies = {}
self.cache = {}
self.feature_flags = {}
# AWS clients
self.dynamodb = boto3.resource('dynamodb')
self.s3 = boto3.client('s3')
self.ssm = boto3.client('ssm')
# Cache configuration
self.cache_table_name = config.get('cache_table_name', 'degradation-cache')
self.cache_table = self.dynamodb.Table(self.cache_table_name)
def register_fallback_strategy(self, strategy: FallbackStrategy):
"""Register a fallback strategy for a dependency"""
self.fallback_strategies[strategy.dependency_name] = strategy
logging.info(f"Registered fallback strategy for {strategy.dependency_name}")
async def execute_with_fallback(self, dependency_name: str,
primary_func: Callable,
*args, **kwargs) -> Dict[str, Any]:
"""Execute function with fallback on dependency failure"""
try:
# Check dependency health
status = self.health_checker.get_dependency_status(dependency_name)
if status and status.is_healthy:
# Dependency is healthy, execute primary function
result = await primary_func(*args, **kwargs)
# Cache successful result for future fallback
await self._cache_result(dependency_name, result)
return {
'success': True,
'data': result,
'source': 'primary',
'degradation_level': 'none'
}
else:
# Dependency is unhealthy, use fallback
return await self._execute_fallback(dependency_name, *args, **kwargs)
except Exception as e:
logging.error(f"Primary function failed for {dependency_name}: {str(e)}")
return await self._execute_fallback(dependency_name, *args, **kwargs)
async def _execute_fallback(self, dependency_name: str, *args, **kwargs) -> Dict[str, Any]:
"""Execute fallback strategy for failed dependency"""
try:
strategy = self.fallback_strategies.get(dependency_name)
if not strategy or not strategy.enabled:
return {
'success': False,
'error': f'No fallback strategy available for {dependency_name}',
'source': 'none',
'degradation_level': 'critical'
}
if strategy.fallback_type == 'cached_response':
return await self._get_cached_fallback(dependency_name)
elif strategy.fallback_type == 'static_response':
return await self._get_static_fallback(strategy)
elif strategy.fallback_type == 'alternative_service':
return await self._get_alternative_service_fallback(strategy, *args, **kwargs)
elif strategy.fallback_type == 'degraded_functionality':
return await self._get_degraded_functionality_fallback(strategy, *args, **kwargs)
else:
return {
'success': False,
'error': f'Unknown fallback type: {strategy.fallback_type}',
'source': 'fallback',
'degradation_level': 'critical'
}
except Exception as e:
logging.error(f"Fallback execution failed for {dependency_name}: {str(e)}")
return {
'success': False,
'error': str(e),
'source': 'fallback',
'degradation_level': 'critical'
}
async def _get_cached_fallback(self, dependency_name: str) -> Dict[str, Any]:
"""Get cached response as fallback"""
try:
response = self.cache_table.get_item(
Key={'dependency_name': dependency_name}
)
if 'Item' in response:
cached_data = response['Item']
cache_age = time.time() - float(cached_data.get('timestamp', 0))
return {
'success': True,
'data': json.loads(cached_data.get('data', '{}')),
'source': 'cache',
'degradation_level': 'reduced',
'cache_age_seconds': cache_age
}
else:
return {
'success': False,
'error': 'No cached data available',
'source': 'cache',
'degradation_level': 'critical'
}
except Exception as e:
logging.error(f"Cache fallback failed: {str(e)}")
return {
'success': False,
'error': str(e),
'source': 'cache',
'degradation_level': 'critical'
}
async def _get_static_fallback(self, strategy: FallbackStrategy) -> Dict[str, Any]:
"""Get static response as fallback"""
return {
'success': True,
'data': strategy.fallback_data,
'source': 'static',
'degradation_level': 'minimal'
}
async def _cache_result(self, dependency_name: str, result: Any):
"""Cache successful result for future fallback use"""
try:
self.cache_table.put_item(
Item={
'dependency_name': dependency_name,
'data': json.dumps(result, default=str),
'timestamp': str(time.time()),
'ttl': int(time.time() + 3600) # 1 hour TTL
}
)
except Exception as e:
logging.error(f"Failed to cache result for {dependency_name}: {str(e)}")
class FeatureToggleManager:
"""Manager for feature toggles during degradation"""
def __init__(self, config: Dict[str, Any]):
self.config = config
self.ssm = boto3.client('ssm')
self.feature_flags = {}
self.parameter_prefix = config.get('parameter_prefix', '/app/features/')
async def load_feature_flags(self):
"""Load feature flags from AWS Systems Manager Parameter Store"""
try:
response = self.ssm.get_parameters_by_path(
Path=self.parameter_prefix,
Recursive=True
)
for parameter in response['Parameters']:
feature_name = parameter['Name'].replace(self.parameter_prefix, '')
self.feature_flags[feature_name] = parameter['Value'].lower() == 'true'
logging.info(f"Loaded {len(self.feature_flags)} feature flags")
except Exception as e:
logging.error(f"Failed to load feature flags: {str(e)}")
def is_feature_enabled(self, feature_name: str) -> bool:
"""Check if a feature is enabled"""
return self.feature_flags.get(feature_name, True) # Default to enabled
async def disable_feature(self, feature_name: str):
"""Disable a feature during degradation"""
try:
parameter_name = f"{self.parameter_prefix}{feature_name}"
self.ssm.put_parameter(
Name=parameter_name,
Value='false',
Type='String',
Overwrite=True
)
self.feature_flags[feature_name] = False
logging.info(f"Disabled feature: {feature_name}")
except Exception as e:
logging.error(f"Failed to disable feature {feature_name}: {str(e)}")
# Usage example
async def main():
config = {
'health_checker': {
'check_interval_seconds': 30,
'failure_threshold': 3
},
'cache_table_name': 'degradation-cache'
}
# Initialize graceful degradation manager
degradation_manager = GracefulDegradationManager(config)
# Register dependencies
async def check_user_service_health():
# Implement actual health check
return True
degradation_manager.health_checker.register_dependency(
'user_service',
DependencyType.IMPORTANT,
check_user_service_health
)
# Register fallback strategy
fallback_strategy = FallbackStrategy(
dependency_name='user_service',
fallback_type='cached_response',
fallback_data={},
cache_ttl_seconds=3600
)
degradation_manager.register_fallback_strategy(fallback_strategy)
# Start health monitoring
await degradation_manager.health_checker.start_health_monitoring()
if __name__ == "__main__":
asyncio.run(main())AWS Services Used
- AWS Systems Manager Parameter Store: Feature flag management and configuration storage
- Amazon DynamoDB: Caching layer for fallback responses and dependency status
- Amazon S3: Static content delivery for degraded functionality
- Amazon CloudFront: CDN for serving cached and static content during degradation
- AWS Lambda: Serverless functions for health checks and fallback processing
- Amazon API Gateway: API management with built-in throttling and fallback responses
- Amazon ElastiCache: High-performance caching for frequently accessed fallback data
- Amazon CloudWatch: Monitoring and alerting for degradation events and recovery
- AWS Step Functions: Workflow orchestration with fallback and retry logic
- Amazon SQS: Message queuing for asynchronous fallback processing
- Amazon SNS: Notifications for degradation events and system status changes
- AWS X-Ray: Distributed tracing for monitoring degradation patterns and performance
- Amazon Route 53: DNS-based failover and health checking for service endpoints
- Elastic Load Balancing: Load balancing with health checks and automatic failover
- AWS Config: Configuration compliance monitoring for degradation policies
- AWS Secrets Manager: Secure storage of fallback service credentials and API keys
Benefits
- Improved System Resilience: Core functionality continues even when dependencies fail
- Better User Experience: Users can still access essential features during outages
- Reduced Blast Radius: Dependency failures don’t cause complete system outages
- Faster Recovery: Systems can operate in degraded mode while issues are resolved
- Cost Optimization: Reduced infrastructure requirements during degraded operation
- Enhanced Availability: Higher overall system availability through graceful degradation
- Simplified Incident Response: Clear degradation levels help prioritize recovery efforts
- Business Continuity: Critical business processes can continue with reduced functionality
- Improved Testing: Degradation scenarios can be tested and validated regularly
- Better Monitoring: Clear visibility into system health and degradation levels
Related Resources
- AWS Well-Architected Reliability Pillar
- Implement Graceful Degradation
- AWS Systems Manager Parameter Store
- Amazon DynamoDB Best Practices
- Circuit Breaker Pattern
- Feature Flags and Toggles
- Amazon CloudFront User Guide
- Graceful Degradation Patterns
- Amazon Route 53 Health Checks
- AWS Lambda Best Practices
- Amazon ElastiCache User Guide
- Building Resilient Systems