OPS09 - How do you understand the health of your operations?

Best Practices

OPS09-BP01 BP01 - Measure operations goals and KPIs with metrics OPS09-BP02 BP02 - Communicate status and trends to ensure visibility into operation OPS09-BP03 BP03 - Review operations metrics and prioritize improvement

Best Practices

This question includes the following best practices:

Key Concepts

Strategy and Governance

Operational KPIs: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Incident management quality: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Process efficiency: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Operational Execution

Team performance signals: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Automation coverage: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Continuous improvement metrics: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Implementation Approach

1. Define operational performance model

Select KPIs such as MTTR, change failure rate, and toil ratio
Set targets for incident response and remediation quality
Define process SLAs for operational workflows
Map metrics to business-critical services

2. Instrument operational workflows

Capture incident timeline and response data
Measure alert load and acknowledgement times
Track manual versus automated remediation actions
Create dashboards for operational leadership

3. Improve operational execution

Identify repeated failure patterns and process bottlenecks
Automate high-volume repetitive runbook steps
Refine escalation policies by severity and ownership
Integrate post-incident actions into sprint planning

4. Review and govern

Run monthly operations health reviews
Benchmark teams against standardized KPIs
Reward measurable reduction in toil and failure rates
Update operating procedures based on trend analysis

AWS Services to Consider

Amazon CloudWatch

Collects metrics, logs, alarms, and dashboards so teams can detect issues early and track operational outcomes.

AWS Systems Manager Incident Manager

Helps prepare response plans, escalation paths, and timeline tracking during incidents.

Amazon EventBridge

Routes events between services and triggers automated responses for operational events.

AWS Lambda

Runs event-driven code without managing servers, ideal for automation and on-demand operational workflows.

AWS Well-Architected Tool

Captures workload reviews, risks, and improvement plans so teams can continuously track architecture quality.

Common Challenges and Solutions

Challenge: Measuring only system uptime

Solution: Include process and team effectiveness metrics to get a complete operational health view.

Challenge: Inconsistent incident handling

Solution: Standardize incident command structures and post-incident templates across teams.

Challenge: No visibility into operational toil

Solution: Track repetitive manual effort and prioritize automation with the highest impact.

OPS09 - How do you understand the health of your operations?

Best Practices

Best Practices

Key Concepts

Strategy and Governance

Operational Execution

Implementation Approach

1. Define operational performance model

2. Instrument operational workflows

3. Improve operational execution

4. Review and govern

AWS Services to Consider

Amazon CloudWatch

AWS Systems Manager Incident Manager

Amazon EventBridge

AWS Lambda

AWS Well-Architected Tool

Common Challenges and Solutions

Challenge: Measuring only system uptime

Challenge: Inconsistent incident handling

Challenge: No visibility into operational toil

Related Resources

Related Resources