Skip to content

OPS09 - How do you understand the health of your operations?

Best Practices

Best Practices

This question includes the following best practices:

Key Concepts

Strategy and Governance

Operational KPIs: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Incident management quality: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Process efficiency: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Operational Execution

Team performance signals: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Automation coverage: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Continuous improvement metrics: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Implementation Approach

1. Define operational performance model

  • Select KPIs such as MTTR, change failure rate, and toil ratio
  • Set targets for incident response and remediation quality
  • Define process SLAs for operational workflows
  • Map metrics to business-critical services

2. Instrument operational workflows

  • Capture incident timeline and response data
  • Measure alert load and acknowledgement times
  • Track manual versus automated remediation actions
  • Create dashboards for operational leadership

3. Improve operational execution

  • Identify repeated failure patterns and process bottlenecks
  • Automate high-volume repetitive runbook steps
  • Refine escalation policies by severity and ownership
  • Integrate post-incident actions into sprint planning

4. Review and govern

  • Run monthly operations health reviews
  • Benchmark teams against standardized KPIs
  • Reward measurable reduction in toil and failure rates
  • Update operating procedures based on trend analysis

AWS Services to Consider

Amazon CloudWatch

Collects metrics, logs, alarms, and dashboards so teams can detect issues early and track operational outcomes.

AWS Systems Manager Incident Manager

Helps prepare response plans, escalation paths, and timeline tracking during incidents.

Amazon EventBridge

Routes events between services and triggers automated responses for operational events.

AWS Lambda

Runs event-driven code without managing servers, ideal for automation and on-demand operational workflows.

AWS Well-Architected Tool

Captures workload reviews, risks, and improvement plans so teams can continuously track architecture quality.

Common Challenges and Solutions

Challenge: Measuring only system uptime

Solution: Include process and team effectiveness metrics to get a complete operational health view.

Challenge: Inconsistent incident handling

Solution: Standardize incident command structures and post-incident templates across teams.

Challenge: No visibility into operational toil

Solution: Track repetitive manual effort and prioritize automation with the highest impact.