OPS08 - How do you understand the health of your workload?

Best Practices

OPS08-BP01 BP01 - Analyze workload metrics OPS08-BP02 BP02 - Analyze workload logs OPS08-BP03 BP03 - Analyze workload traces OPS08-BP04 BP04 - Create actionable alerts OPS08-BP05 BP05 - Create dashboards

Best Practices

This question includes the following best practices:

Key Concepts

Strategy and Governance

Health indicators: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Customer-impact metrics: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Dependency visibility: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Operational Execution

SLI/SLO management: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Anomaly detection: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Capacity awareness: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Implementation Approach

1. Define workload health model

Identify critical user journeys and associated SLIs
Set SLO targets and error budgets per service
Map critical dependencies and failure domains
Define dashboard views for executives and operators

2. Implement telemetry and dashboards

Collect metrics, logs, and traces for critical components
Create service and workload-level health dashboards
Configure alarms with customer-impact severity levels
Capture synthetic checks for key endpoints

3. Operationalize health management

Review SLO performance in recurring ops meetings
Track incidents against error budget policies
Automate notifications and incident creation on threshold breach
Correlate business KPIs with technical health signals

4. Refine continuously

Analyze false positives and missed detections
Adjust SLIs and thresholds as workload evolves
Expand telemetry for new dependencies
Use trends to drive preventive improvements

AWS Services to Consider

Amazon CloudWatch

Collects metrics, logs, alarms, and dashboards so teams can detect issues early and track operational outcomes.

AWS X-Ray

Traces distributed requests to identify latency bottlenecks and dependency failures across microservices.

AWS Health Dashboard

Provides service and account-specific health events so operators can respond quickly to AWS-impacting incidents.

Amazon Route 53

Provides DNS routing policies and health checks for latency and availability optimization.

Amazon EventBridge

Routes events between services and triggers automated responses for operational events.

Common Challenges and Solutions

Challenge: Technical metrics not tied to customer experience

Solution: Prioritize SLIs that represent user-visible outcomes and map alerts to business impact.

Challenge: Siloed observability data

Solution: Unify dashboards and incident workflows across logs, traces, and metrics.

Challenge: Unclear ownership of SLO breaches

Solution: Assign explicit owners per SLO and define response actions for budget burn rates.

OPS08 - How do you understand the health of your workload?

Best Practices

Best Practices

Key Concepts

Strategy and Governance

Operational Execution

Implementation Approach

1. Define workload health model

2. Implement telemetry and dashboards

3. Operationalize health management

4. Refine continuously

AWS Services to Consider

Amazon CloudWatch

AWS X-Ray

AWS Health Dashboard

Amazon Route 53

Amazon EventBridge

Common Challenges and Solutions

Challenge: Technical metrics not tied to customer experience

Challenge: Siloed observability data

Challenge: Unclear ownership of SLO breaches

Related Resources

Related Resources