OPS08 - How do you understand the health of your workload?
Best Practices
Best Practices
This question includes the following best practices:
Key Concepts
Strategy and Governance
Health indicators: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Customer-impact metrics: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Dependency visibility: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Operational Execution
SLI/SLO management: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Anomaly detection: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Capacity awareness: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Implementation Approach
1. Define workload health model
- Identify critical user journeys and associated SLIs
- Set SLO targets and error budgets per service
- Map critical dependencies and failure domains
- Define dashboard views for executives and operators
2. Implement telemetry and dashboards
- Collect metrics, logs, and traces for critical components
- Create service and workload-level health dashboards
- Configure alarms with customer-impact severity levels
- Capture synthetic checks for key endpoints
3. Operationalize health management
- Review SLO performance in recurring ops meetings
- Track incidents against error budget policies
- Automate notifications and incident creation on threshold breach
- Correlate business KPIs with technical health signals
4. Refine continuously
- Analyze false positives and missed detections
- Adjust SLIs and thresholds as workload evolves
- Expand telemetry for new dependencies
- Use trends to drive preventive improvements
AWS Services to Consider
Amazon CloudWatch
Collects metrics, logs, alarms, and dashboards so teams can detect issues early and track operational outcomes.
AWS X-Ray
Traces distributed requests to identify latency bottlenecks and dependency failures across microservices.
AWS Health Dashboard
Provides service and account-specific health events so operators can respond quickly to AWS-impacting incidents.
Amazon Route 53
Provides DNS routing policies and health checks for latency and availability optimization.
Amazon EventBridge
Routes events between services and triggers automated responses for operational events.
Common Challenges and Solutions
Challenge: Technical metrics not tied to customer experience
Solution: Prioritize SLIs that represent user-visible outcomes and map alerts to business impact.
Challenge: Siloed observability data
Solution: Unify dashboards and incident workflows across logs, traces, and metrics.
Challenge: Unclear ownership of SLO breaches
Solution: Assign explicit owners per SLO and define response actions for budget burn rates.