REL05

REL05-BP05 - Set client timeouts

REL05-BP05: Set client timeouts

Overview

Configure appropriate client timeouts for all network operations to prevent indefinite blocking and resource exhaustion. Proper timeout configuration ensures that clients can detect failures quickly, free up resources, and implement appropriate fallback strategies when services become unresponsive.

Implementation Steps

1. Configure Connection Timeouts

Set connection establishment timeouts for all network calls
Configure different timeouts for different service types and criticality levels
Implement timeout values based on network latency and service SLAs
Design timeout escalation for retry scenarios

2. Establish Read and Write Timeouts

Configure read timeouts for data retrieval operations
Set write timeouts for data submission operations
Implement different timeouts for streaming vs batch operations
Design timeout handling for long-running operations

3. Implement Request-Level Timeouts

Set end-to-end request timeouts including all retry attempts
Configure per-operation timeouts based on expected processing time
Implement timeout propagation across service boundaries
Design timeout budgets for complex workflows

4. Configure Service-Specific Timeouts

Set database connection and query timeouts
Configure cache operation timeouts
Implement API call timeouts with appropriate values
Design timeout strategies for third-party service integrations

5. Implement Timeout Monitoring and Alerting

Track timeout occurrences and patterns
Monitor timeout effectiveness and false positives
Implement automated timeout tuning based on performance data
Create dashboards for timeout metrics and analysis

6. Design Timeout Error Handling

Implement graceful timeout error handling
Design fallback strategies when timeouts occur
Create informative timeout error messages
Establish timeout retry policies and backoff strategies

Implementation Examples

Example 1: Comprehensive Client Timeout Management

View code

import asyncio
import aiohttp
import time
import logging
from typing import Dict, Optional, Any
from dataclasses import dataclass
from enum import Enum
import boto3
from contextlib import asynccontextmanager

class TimeoutType(Enum):
    CONNECTION = "connection"
    READ = "read"
    WRITE = "write"
    TOTAL = "total"

@dataclass
class TimeoutConfig:
    connection_timeout_ms: int = 5000
    read_timeout_ms: int = 30000
    write_timeout_ms: int = 30000
    total_timeout_ms: int = 60000
    retry_timeout_ms: int = 120000

class TimeoutManager:
    """Centralized timeout configuration management"""
    
    def __init__(self, config: Dict[str, Any]):
        self.default_timeouts = TimeoutConfig(**config.get('default_timeouts', {}))
        self.service_timeouts = {}
        
        # Load service-specific timeouts
        for service, timeout_config in config.get('service_timeouts', {}).items():
            self.service_timeouts[service] = TimeoutConfig(**timeout_config)
    
    def get_timeout_config(self, service_name: str = "default") -> TimeoutConfig:
        """Get timeout configuration for a service"""
        return self.service_timeouts.get(service_name, self.default_timeouts)
    
    def update_timeout_config(self, service_name: str, timeout_config: TimeoutConfig):
        """Update timeout configuration for a service"""
        self.service_timeouts[service_name] = timeout_config
        logging.info(f"Updated timeout config for {service_name}")

class HTTPClientWithTimeouts:
    """HTTP client with comprehensive timeout handling"""
    
    def __init__(self, timeout_manager: TimeoutManager):
        self.timeout_manager = timeout_manager
        self.session = None
        
    async def __aenter__(self):
        """Async context manager entry"""
        self.session = aiohttp.ClientSession()
        return self
    
    async def __aexit__(self, exc_type, exc_val, exc_tb):
        """Async context manager exit"""
        if self.session:
            await self.session.close()
    
    async def get(self, url: str, service_name: str = "default", **kwargs) -> Dict[str, Any]:
        """HTTP GET with timeout handling"""
        return await self._make_request('GET', url, service_name, **kwargs)
    
    async def post(self, url: str, service_name: str = "default", **kwargs) -> Dict[str, Any]:
        """HTTP POST with timeout handling"""
        return await self._make_request('POST', url, service_name, **kwargs)
    
    async def _make_request(self, method: str, url: str, service_name: str, **kwargs) -> Dict[str, Any]:
        """Make HTTP request with comprehensive timeout handling"""
        timeout_config = self.timeout_manager.get_timeout_config(service_name)
        
        # Create aiohttp timeout configuration
        timeout = aiohttp.ClientTimeout(
            total=timeout_config.total_timeout_ms / 1000,
            connect=timeout_config.connection_timeout_ms / 1000,
            sock_read=timeout_config.read_timeout_ms / 1000
        )
        
        start_time = time.time()
        
        try:
            async with self.session.request(method, url, timeout=timeout, **kwargs) as response:
                response_time = (time.time() - start_time) * 1000
                
                # Read response with timeout
                try:
                    data = await response.json()
                except asyncio.TimeoutError:
                    raise TimeoutError(f"Read timeout after {timeout_config.read_timeout_ms}ms")
                
                return {
                    'success': True,
                    'status_code': response.status,
                    'data': data,
                    'response_time_ms': response_time,
                    'service_name': service_name
                }
                
        except asyncio.TimeoutError as e:
            response_time = (time.time() - start_time) * 1000
            timeout_type = self._classify_timeout_error(str(e))
            
            logging.warning(f"Timeout for {service_name} {method} {url}: {timeout_type} after {response_time:.2f}ms")
            
            return {
                'success': False,
                'error': f'{timeout_type} timeout',
                'timeout_type': timeout_type,
                'response_time_ms': response_time,
                'service_name': service_name
            }
        
        except Exception as e:
            response_time = (time.time() - start_time) * 1000
            logging.error(f"Request failed for {service_name}: {str(e)}")
            
            return {
                'success': False,
                'error': str(e),
                'response_time_ms': response_time,
                'service_name': service_name
            }
    
    def _classify_timeout_error(self, error_message: str) -> str:
        """Classify timeout error type"""
        error_lower = error_message.lower()
        
        if 'connect' in error_lower:
            return 'connection'
        elif 'read' in error_lower:
            return 'read'
        elif 'write' in error_lower:
            return 'write'
        else:
            return 'total'

class DatabaseClientWithTimeouts:
    """Database client with timeout configuration"""
    
    def __init__(self, timeout_manager: TimeoutManager):
        self.timeout_manager = timeout_manager
        self.connection_pool = None
    
    async def execute_query(self, query: str, params: Optional[Dict] = None, 
                          service_name: str = "database") -> Dict[str, Any]:
        """Execute database query with timeout"""
        timeout_config = self.timeout_manager.get_timeout_config(service_name)
        
        start_time = time.time()
        
        try:
            # Simulate database query with timeout
            result = await asyncio.wait_for(
                self._execute_query_impl(query, params),
                timeout=timeout_config.total_timeout_ms / 1000
            )
            
            response_time = (time.time() - start_time) * 1000
            
            return {
                'success': True,
                'data': result,
                'response_time_ms': response_time,
                'service_name': service_name
            }
            
        except asyncio.TimeoutError:
            response_time = (time.time() - start_time) * 1000
            logging.warning(f"Database query timeout after {response_time:.2f}ms")
            
            return {
                'success': False,
                'error': 'Query timeout',
                'response_time_ms': response_time,
                'service_name': service_name
            }
        
        except Exception as e:
            response_time = (time.time() - start_time) * 1000
            logging.error(f"Database query failed: {str(e)}")
            
            return {
                'success': False,
                'error': str(e),
                'response_time_ms': response_time,
                'service_name': service_name
            }
    
    async def _execute_query_impl(self, query: str, params: Optional[Dict] = None):
        """Simulate database query execution"""
        # Simulate query processing time
        await asyncio.sleep(0.1)
        return {'rows': [{'id': 1, 'name': 'test'}]}

class AWSClientWithTimeouts:
    """AWS service client with timeout configuration"""
    
    def __init__(self, timeout_manager: TimeoutManager):
        self.timeout_manager = timeout_manager
        self.clients = {}
    
    def get_client(self, service_name: str, aws_service: str):
        """Get AWS client with timeout configuration"""
        timeout_config = self.timeout_manager.get_timeout_config(service_name)
        
        if service_name not in self.clients:
            from botocore.config import Config
            
            # Configure boto3 client with timeouts
            config = Config(
                connect_timeout=timeout_config.connection_timeout_ms / 1000,
                read_timeout=timeout_config.read_timeout_ms / 1000,
                retries={'max_attempts': 0}  # Handle retries separately
            )
            
            self.clients[service_name] = boto3.client(aws_service, config=config)
        
        return self.clients[service_name]
    
    async def call_aws_service(self, service_name: str, aws_service: str, 
                             operation: str, **kwargs) -> Dict[str, Any]:
        """Call AWS service with timeout handling"""
        timeout_config = self.timeout_manager.get_timeout_config(service_name)
        client = self.get_client(service_name, aws_service)
        
        start_time = time.time()
        
        try:
            # Execute AWS operation with total timeout
            operation_func = getattr(client, operation)
            
            result = await asyncio.wait_for(
                asyncio.get_event_loop().run_in_executor(
                    None, lambda: operation_func(**kwargs)
                ),
                timeout=timeout_config.total_timeout_ms / 1000
            )
            
            response_time = (time.time() - start_time) * 1000
            
            return {
                'success': True,
                'data': result,
                'response_time_ms': response_time,
                'service_name': service_name
            }
            
        except asyncio.TimeoutError:
            response_time = (time.time() - start_time) * 1000
            logging.warning(f"AWS {aws_service} {operation} timeout after {response_time:.2f}ms")
            
            return {
                'success': False,
                'error': f'AWS {operation} timeout',
                'response_time_ms': response_time,
                'service_name': service_name
            }
        
        except Exception as e:
            response_time = (time.time() - start_time) * 1000
            logging.error(f"AWS {aws_service} {operation} failed: {str(e)}")
            
            return {
                'success': False,
                'error': str(e),
                'response_time_ms': response_time,
                'service_name': service_name
            }

class TimeoutMetricsCollector:
    """Collect and analyze timeout metrics"""
    
    def __init__(self):
        self.timeout_events = []
        self.service_metrics = {}
    
    def record_timeout_event(self, service_name: str, timeout_type: str, 
                           response_time_ms: float):
        """Record timeout event for analysis"""
        event = {
            'service_name': service_name,
            'timeout_type': timeout_type,
            'response_time_ms': response_time_ms,
            'timestamp': time.time()
        }
        
        self.timeout_events.append(event)
        
        # Update service metrics
        if service_name not in self.service_metrics:
            self.service_metrics[service_name] = {
                'total_timeouts': 0,
                'timeout_types': {},
                'avg_response_time': 0
            }
        
        metrics = self.service_metrics[service_name]
        metrics['total_timeouts'] += 1
        
        if timeout_type not in metrics['timeout_types']:
            metrics['timeout_types'][timeout_type] = 0
        metrics['timeout_types'][timeout_type] += 1
        
        # Update average response time
        metrics['avg_response_time'] = (
            (metrics['avg_response_time'] * (metrics['total_timeouts'] - 1) + response_time_ms) /
            metrics['total_timeouts']
        )
    
    def get_timeout_analysis(self, service_name: Optional[str] = None) -> Dict[str, Any]:
        """Get timeout analysis for a service or all services"""
        if service_name:
            return self.service_metrics.get(service_name, {})
        else:
            return {
                'total_events': len(self.timeout_events),
                'service_metrics': self.service_metrics,
                'recent_events': self.timeout_events[-10:]  # Last 10 events
            }

# Usage example
async def main():
    # Configure timeouts
    timeout_config = {
        'default_timeouts': {
            'connection_timeout_ms': 5000,
            'read_timeout_ms': 30000,
            'total_timeout_ms': 60000
        },
        'service_timeouts': {
            'user_service': {
                'connection_timeout_ms': 2000,
                'read_timeout_ms': 10000,
                'total_timeout_ms': 15000
            },
            'payment_service': {
                'connection_timeout_ms': 3000,
                'read_timeout_ms': 20000,
                'total_timeout_ms': 30000
            }
        }
    }
    
    timeout_manager = TimeoutManager(timeout_config)
    metrics_collector = TimeoutMetricsCollector()
    
    # Test HTTP client with timeouts
    async with HTTPClientWithTimeouts(timeout_manager) as http_client:
        # Make requests to different services
        result1 = await http_client.get('https://httpbin.org/delay/1', 'user_service')
        print(f"User service result: {result1}")
        
        result2 = await http_client.get('https://httpbin.org/delay/5', 'payment_service')
        print(f"Payment service result: {result2}")
        
        # Record timeout events if they occurred
        if not result1['success'] and 'timeout' in result1.get('error', ''):
            metrics_collector.record_timeout_event(
                'user_service', 
                result1.get('timeout_type', 'unknown'),
                result1['response_time_ms']
            )
        
        if not result2['success'] and 'timeout' in result2.get('error', ''):
            metrics_collector.record_timeout_event(
                'payment_service',
                result2.get('timeout_type', 'unknown'), 
                result2['response_time_ms']
            )
    
    # Test database client
    db_client = DatabaseClientWithTimeouts(timeout_manager)
    db_result = await db_client.execute_query("SELECT * FROM users", service_name="database")
    print(f"Database result: {db_result}")
    
    # Get timeout analysis
    analysis = metrics_collector.get_timeout_analysis()
    print(f"Timeout analysis: {analysis}")

if __name__ == "__main__":
    asyncio.run(main())

AWS Services Used

AWS SDK (Boto3): Built-in timeout configuration for all AWS service calls
Amazon API Gateway: Request timeout configuration and client timeout handling
AWS Lambda: Function timeout settings and client invocation timeouts
Amazon RDS: Database connection and query timeout configuration
Amazon DynamoDB: Request timeout and connection timeout settings
Amazon ElastiCache: Connection timeout and operation timeout configuration
Amazon S3: Upload/download timeout configuration for large objects
Amazon SQS: Message receive timeout and visibility timeout settings
AWS Step Functions: State timeout and heartbeat timeout configuration
Amazon Kinesis: Stream read/write timeout configuration
AWS Systems Manager: Parameter store timeout configuration
Amazon CloudWatch: Timeout metrics monitoring and alerting
AWS X-Ray: Timeout pattern analysis and distributed tracing
Amazon Route 53: Health check timeout configuration
Elastic Load Balancing: Backend timeout and connection timeout settings

Benefits

Improved System Responsiveness: Prevents indefinite blocking and resource exhaustion
Better Error Detection: Quick identification of unresponsive services and network issues
Resource Management: Prevents connection pool exhaustion and memory leaks
Enhanced User Experience: Faster error feedback and fallback activation
System Stability: Prevents cascading failures due to hanging connections
Better Monitoring: Clear visibility into service response times and timeout patterns
Cost Optimization: Reduced resource consumption through proper timeout handling
Improved Debugging: Easier identification of performance bottlenecks
SLA Compliance: Predictable response times through proper timeout configuration
Operational Efficiency: Automated timeout handling reduces manual intervention

Back to REL05