Skip to content

OPS08 - How do you understand the health of your workload?

Best Practices

Best Practices

This question includes the following best practices:

Key Concepts

Strategy and Governance

Health indicators: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Customer-impact metrics: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Dependency visibility: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Operational Execution

SLI/SLO management: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Anomaly detection: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Capacity awareness: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Implementation Approach

1. Define workload health model

  • Identify critical user journeys and associated SLIs
  • Set SLO targets and error budgets per service
  • Map critical dependencies and failure domains
  • Define dashboard views for executives and operators

2. Implement telemetry and dashboards

  • Collect metrics, logs, and traces for critical components
  • Create service and workload-level health dashboards
  • Configure alarms with customer-impact severity levels
  • Capture synthetic checks for key endpoints

3. Operationalize health management

  • Review SLO performance in recurring ops meetings
  • Track incidents against error budget policies
  • Automate notifications and incident creation on threshold breach
  • Correlate business KPIs with technical health signals

4. Refine continuously

  • Analyze false positives and missed detections
  • Adjust SLIs and thresholds as workload evolves
  • Expand telemetry for new dependencies
  • Use trends to drive preventive improvements

AWS Services to Consider

Amazon CloudWatch

Collects metrics, logs, alarms, and dashboards so teams can detect issues early and track operational outcomes.

AWS X-Ray

Traces distributed requests to identify latency bottlenecks and dependency failures across microservices.

AWS Health Dashboard

Provides service and account-specific health events so operators can respond quickly to AWS-impacting incidents.

Amazon Route 53

Provides DNS routing policies and health checks for latency and availability optimization.

Amazon EventBridge

Routes events between services and triggers automated responses for operational events.

Common Challenges and Solutions

Challenge: Technical metrics not tied to customer experience

Solution: Prioritize SLIs that represent user-visible outcomes and map alerts to business impact.

Challenge: Siloed observability data

Solution: Unify dashboards and incident workflows across logs, traces, and metrics.

Challenge: Unclear ownership of SLO breaches

Solution: Assign explicit owners per SLO and define response actions for budget burn rates.