Skip to content
REL05

REL05-BP01 - Implement graceful degradation to transform applicable hard dependencies into soft dependencies

REL05-BP01: Implement graceful degradation to transform applicable hard dependencies into soft dependencies

Overview

Design systems to gracefully degrade functionality when dependencies become unavailable, transforming hard dependencies that would cause complete system failure into soft dependencies that allow core functionality to continue. This approach maintains essential services while providing reduced functionality, ensuring better user experience and system resilience during partial outages.

Implementation Steps

1. Identify and Classify Dependencies

  • Categorize dependencies as critical, important, or optional
  • Map dependencies to specific features and functionality
  • Identify which features can operate with reduced capability
  • Document fallback strategies for each dependency type

2. Design Fallback Mechanisms

  • Implement cached responses for unavailable services
  • Create default behaviors when dependencies fail
  • Design simplified workflows that bypass failed components
  • Establish static content delivery for dynamic services

3. Implement Feature Toggles and Circuit Breakers

  • Deploy feature flags to disable non-essential functionality
  • Implement circuit breakers to detect and isolate failures
  • Create automatic fallback activation based on health checks
  • Design manual override capabilities for emergency situations

4. Establish Graceful User Experience

  • Design user interfaces that adapt to reduced functionality
  • Implement informative error messages and status indicators
  • Provide alternative workflows when primary paths fail
  • Maintain core user journeys even with degraded services

5. Implement Data and State Management

  • Cache critical data locally for offline operation
  • Design eventual consistency patterns for data synchronization
  • Implement read-only modes when write operations fail
  • Create data replication strategies for high availability

6. Monitor and Test Degradation Scenarios

  • Implement monitoring for dependency health and fallback activation
  • Create automated testing for degradation scenarios
  • Establish alerting for when systems operate in degraded mode
  • Regularly test fallback mechanisms and recovery procedures

Implementation Examples

Example 1: Graceful Degradation Framework

View code
import boto3
import json
import logging
import time
import asyncio
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Any, Callable, Union
from dataclasses import dataclass, asdict
from enum import Enum
from abc import ABC, abstractmethod
import threading
from contextlib import asynccontextmanager

class DependencyType(Enum):
    CRITICAL = "critical"
    IMPORTANT = "important"
    OPTIONAL = "optional"

class DegradationLevel(Enum):
    FULL_FUNCTIONALITY = "full_functionality"
    REDUCED_FUNCTIONALITY = "reduced_functionality"
    MINIMAL_FUNCTIONALITY = "minimal_functionality"
    EMERGENCY_MODE = "emergency_mode"

@dataclass
class DependencyStatus:
    name: str
    dependency_type: DependencyType
    is_healthy: bool
    last_check: datetime
    failure_count: int
    response_time_ms: float
    error_message: Optional[str] = None

@dataclass
class FallbackStrategy:
    dependency_name: str
    fallback_type: str
    fallback_data: Dict[str, Any]
    cache_ttl_seconds: int
    enabled: bool = True

class DependencyHealthChecker:
    """Health checker for system dependencies"""
    
    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.dependencies = {}
        self.health_status = {}
        self.check_interval = config.get('check_interval_seconds', 30)
        self.failure_threshold = config.get('failure_threshold', 3)
        
    def register_dependency(self, name: str, dependency_type: DependencyType, 
                          health_check_func: Callable) -> None:
        """Register a dependency for health monitoring"""
        self.dependencies[name] = {
            'type': dependency_type,
            'health_check': health_check_func,
            'status': DependencyStatus(
                name=name,
                dependency_type=dependency_type,
                is_healthy=True,
                last_check=datetime.utcnow(),
                failure_count=0,
                response_time_ms=0.0
            )
        }
        logging.info(f"Registered dependency: {name} ({dependency_type.value})")
    
    async def start_health_monitoring(self):
        """Start continuous health monitoring"""
        while True:
            try:
                await self._check_all_dependencies()
                await asyncio.sleep(self.check_interval)
            except Exception as e:
                logging.error(f"Health monitoring error: {str(e)}")
                await asyncio.sleep(self.check_interval)
    
    async def _check_all_dependencies(self):
        """Check health of all registered dependencies"""
        tasks = []
        for name, dependency in self.dependencies.items():
            task = asyncio.create_task(self._check_dependency_health(name, dependency))
            tasks.append(task)
        
        await asyncio.gather(*tasks, return_exceptions=True)
    
    async def _check_dependency_health(self, name: str, dependency: Dict[str, Any]):
        """Check health of individual dependency"""
        try:
            start_time = time.time()
            health_check_func = dependency['health_check']
            
            # Execute health check
            is_healthy = await health_check_func()
            response_time = (time.time() - start_time) * 1000
            
            # Update status
            status = dependency['status']
            status.is_healthy = is_healthy
            status.last_check = datetime.utcnow()
            status.response_time_ms = response_time
            
            if is_healthy:
                status.failure_count = 0
                status.error_message = None
            else:
                status.failure_count += 1
                status.error_message = "Health check failed"
            
            self.health_status[name] = status
            
            # Log status changes
            if not is_healthy and status.failure_count >= self.failure_threshold:
                logging.warning(f"Dependency {name} is unhealthy (failures: {status.failure_count})")
            elif is_healthy and status.failure_count == 0:
                logging.info(f"Dependency {name} is healthy (response time: {response_time:.2f}ms)")
                
        except Exception as e:
            logging.error(f"Health check failed for {name}: {str(e)}")
            status = dependency['status']
            status.is_healthy = False
            status.failure_count += 1
            status.error_message = str(e)
            status.last_check = datetime.utcnow()
            self.health_status[name] = status
    
    def get_dependency_status(self, name: str) -> Optional[DependencyStatus]:
        """Get current status of a dependency"""
        return self.health_status.get(name)
    
    def get_system_degradation_level(self) -> DegradationLevel:
        """Determine current system degradation level"""
        critical_failures = 0
        important_failures = 0
        
        for status in self.health_status.values():
            if not status.is_healthy and status.failure_count >= self.failure_threshold:
                if status.dependency_type == DependencyType.CRITICAL:
                    critical_failures += 1
                elif status.dependency_type == DependencyType.IMPORTANT:
                    important_failures += 1
        
        if critical_failures > 0:
            return DegradationLevel.EMERGENCY_MODE
        elif important_failures >= 2:
            return DegradationLevel.MINIMAL_FUNCTIONALITY
        elif important_failures >= 1:
            return DegradationLevel.REDUCED_FUNCTIONALITY
        else:
            return DegradationLevel.FULL_FUNCTIONALITY

class GracefulDegradationManager:
    """Manager for graceful degradation strategies"""
    
    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.health_checker = DependencyHealthChecker(config.get('health_checker', {}))
        self.fallback_strategies = {}
        self.cache = {}
        self.feature_flags = {}
        
        # AWS clients
        self.dynamodb = boto3.resource('dynamodb')
        self.s3 = boto3.client('s3')
        self.ssm = boto3.client('ssm')
        
        # Cache configuration
        self.cache_table_name = config.get('cache_table_name', 'degradation-cache')
        self.cache_table = self.dynamodb.Table(self.cache_table_name)
    
    def register_fallback_strategy(self, strategy: FallbackStrategy):
        """Register a fallback strategy for a dependency"""
        self.fallback_strategies[strategy.dependency_name] = strategy
        logging.info(f"Registered fallback strategy for {strategy.dependency_name}")
    
    async def execute_with_fallback(self, dependency_name: str, 
                                  primary_func: Callable, 
                                  *args, **kwargs) -> Dict[str, Any]:
        """Execute function with fallback on dependency failure"""
        try:
            # Check dependency health
            status = self.health_checker.get_dependency_status(dependency_name)
            
            if status and status.is_healthy:
                # Dependency is healthy, execute primary function
                result = await primary_func(*args, **kwargs)
                
                # Cache successful result for future fallback
                await self._cache_result(dependency_name, result)
                
                return {
                    'success': True,
                    'data': result,
                    'source': 'primary',
                    'degradation_level': 'none'
                }
            else:
                # Dependency is unhealthy, use fallback
                return await self._execute_fallback(dependency_name, *args, **kwargs)
                
        except Exception as e:
            logging.error(f"Primary function failed for {dependency_name}: {str(e)}")
            return await self._execute_fallback(dependency_name, *args, **kwargs)
    
    async def _execute_fallback(self, dependency_name: str, *args, **kwargs) -> Dict[str, Any]:
        """Execute fallback strategy for failed dependency"""
        try:
            strategy = self.fallback_strategies.get(dependency_name)
            
            if not strategy or not strategy.enabled:
                return {
                    'success': False,
                    'error': f'No fallback strategy available for {dependency_name}',
                    'source': 'none',
                    'degradation_level': 'critical'
                }
            
            if strategy.fallback_type == 'cached_response':
                return await self._get_cached_fallback(dependency_name)
            elif strategy.fallback_type == 'static_response':
                return await self._get_static_fallback(strategy)
            elif strategy.fallback_type == 'alternative_service':
                return await self._get_alternative_service_fallback(strategy, *args, **kwargs)
            elif strategy.fallback_type == 'degraded_functionality':
                return await self._get_degraded_functionality_fallback(strategy, *args, **kwargs)
            else:
                return {
                    'success': False,
                    'error': f'Unknown fallback type: {strategy.fallback_type}',
                    'source': 'fallback',
                    'degradation_level': 'critical'
                }
                
        except Exception as e:
            logging.error(f"Fallback execution failed for {dependency_name}: {str(e)}")
            return {
                'success': False,
                'error': str(e),
                'source': 'fallback',
                'degradation_level': 'critical'
            }
    
    async def _get_cached_fallback(self, dependency_name: str) -> Dict[str, Any]:
        """Get cached response as fallback"""
        try:
            response = self.cache_table.get_item(
                Key={'dependency_name': dependency_name}
            )
            
            if 'Item' in response:
                cached_data = response['Item']
                cache_age = time.time() - float(cached_data.get('timestamp', 0))
                
                return {
                    'success': True,
                    'data': json.loads(cached_data.get('data', '{}')),
                    'source': 'cache',
                    'degradation_level': 'reduced',
                    'cache_age_seconds': cache_age
                }
            else:
                return {
                    'success': False,
                    'error': 'No cached data available',
                    'source': 'cache',
                    'degradation_level': 'critical'
                }
                
        except Exception as e:
            logging.error(f"Cache fallback failed: {str(e)}")
            return {
                'success': False,
                'error': str(e),
                'source': 'cache',
                'degradation_level': 'critical'
            }
    
    async def _get_static_fallback(self, strategy: FallbackStrategy) -> Dict[str, Any]:
        """Get static response as fallback"""
        return {
            'success': True,
            'data': strategy.fallback_data,
            'source': 'static',
            'degradation_level': 'minimal'
        }
    
    async def _cache_result(self, dependency_name: str, result: Any):
        """Cache successful result for future fallback use"""
        try:
            self.cache_table.put_item(
                Item={
                    'dependency_name': dependency_name,
                    'data': json.dumps(result, default=str),
                    'timestamp': str(time.time()),
                    'ttl': int(time.time() + 3600)  # 1 hour TTL
                }
            )
        except Exception as e:
            logging.error(f"Failed to cache result for {dependency_name}: {str(e)}")

class FeatureToggleManager:
    """Manager for feature toggles during degradation"""
    
    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.ssm = boto3.client('ssm')
        self.feature_flags = {}
        self.parameter_prefix = config.get('parameter_prefix', '/app/features/')
    
    async def load_feature_flags(self):
        """Load feature flags from AWS Systems Manager Parameter Store"""
        try:
            response = self.ssm.get_parameters_by_path(
                Path=self.parameter_prefix,
                Recursive=True
            )
            
            for parameter in response['Parameters']:
                feature_name = parameter['Name'].replace(self.parameter_prefix, '')
                self.feature_flags[feature_name] = parameter['Value'].lower() == 'true'
            
            logging.info(f"Loaded {len(self.feature_flags)} feature flags")
            
        except Exception as e:
            logging.error(f"Failed to load feature flags: {str(e)}")
    
    def is_feature_enabled(self, feature_name: str) -> bool:
        """Check if a feature is enabled"""
        return self.feature_flags.get(feature_name, True)  # Default to enabled
    
    async def disable_feature(self, feature_name: str):
        """Disable a feature during degradation"""
        try:
            parameter_name = f"{self.parameter_prefix}{feature_name}"
            
            self.ssm.put_parameter(
                Name=parameter_name,
                Value='false',
                Type='String',
                Overwrite=True
            )
            
            self.feature_flags[feature_name] = False
            logging.info(f"Disabled feature: {feature_name}")
            
        except Exception as e:
            logging.error(f"Failed to disable feature {feature_name}: {str(e)}")

# Usage example
async def main():
    config = {
        'health_checker': {
            'check_interval_seconds': 30,
            'failure_threshold': 3
        },
        'cache_table_name': 'degradation-cache'
    }
    
    # Initialize graceful degradation manager
    degradation_manager = GracefulDegradationManager(config)
    
    # Register dependencies
    async def check_user_service_health():
        # Implement actual health check
        return True
    
    degradation_manager.health_checker.register_dependency(
        'user_service',
        DependencyType.IMPORTANT,
        check_user_service_health
    )
    
    # Register fallback strategy
    fallback_strategy = FallbackStrategy(
        dependency_name='user_service',
        fallback_type='cached_response',
        fallback_data={},
        cache_ttl_seconds=3600
    )
    degradation_manager.register_fallback_strategy(fallback_strategy)
    
    # Start health monitoring
    await degradation_manager.health_checker.start_health_monitoring()

if __name__ == "__main__":
    asyncio.run(main())

AWS Services Used

  • AWS Systems Manager Parameter Store: Feature flag management and configuration storage
  • Amazon DynamoDB: Caching layer for fallback responses and dependency status
  • Amazon S3: Static content delivery for degraded functionality
  • Amazon CloudFront: CDN for serving cached and static content during degradation
  • AWS Lambda: Serverless functions for health checks and fallback processing
  • Amazon API Gateway: API management with built-in throttling and fallback responses
  • Amazon ElastiCache: High-performance caching for frequently accessed fallback data
  • Amazon CloudWatch: Monitoring and alerting for degradation events and recovery
  • AWS Step Functions: Workflow orchestration with fallback and retry logic
  • Amazon SQS: Message queuing for asynchronous fallback processing
  • Amazon SNS: Notifications for degradation events and system status changes
  • AWS X-Ray: Distributed tracing for monitoring degradation patterns and performance
  • Amazon Route 53: DNS-based failover and health checking for service endpoints
  • Elastic Load Balancing: Load balancing with health checks and automatic failover
  • AWS Config: Configuration compliance monitoring for degradation policies
  • AWS Secrets Manager: Secure storage of fallback service credentials and API keys

Benefits

  • Improved System Resilience: Core functionality continues even when dependencies fail
  • Better User Experience: Users can still access essential features during outages
  • Reduced Blast Radius: Dependency failures don’t cause complete system outages
  • Faster Recovery: Systems can operate in degraded mode while issues are resolved
  • Cost Optimization: Reduced infrastructure requirements during degraded operation
  • Enhanced Availability: Higher overall system availability through graceful degradation
  • Simplified Incident Response: Clear degradation levels help prioritize recovery efforts
  • Business Continuity: Critical business processes can continue with reduced functionality
  • Improved Testing: Degradation scenarios can be tested and validated regularly
  • Better Monitoring: Clear visibility into system health and degradation levels