Skip to content

OPS07 - How do you know that you are ready to support a workload?

Best Practices

Best Practices

This question includes the following best practices:

Key Concepts

Strategy and Governance

Operational readiness: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Runbook completeness: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Support model design: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Operational Execution

Game days and drills: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Escalation readiness: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Service launch criteria: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Implementation Approach

1. Define readiness standards

  • Document minimum readiness criteria for production support
  • Ensure runbooks cover normal and failure operations
  • Define on-call model, escalation path, and ownership
  • Set service-level objectives and alerting expectations

2. Validate operational artifacts

  • Run tabletop exercises for key incident scenarios
  • Test backup, restore, and failover procedures
  • Confirm dashboards and alarms are complete and actionable
  • Verify access controls and break-glass procedures

3. Launch with guardrails

  • Use launch checklists before production changes
  • Require support handoff signoff from engineering and operations
  • Ensure knowledge transfer for tier-1 and tier-2 responders
  • Run post-launch hypercare for critical workloads

4. Continuously assess readiness

  • Audit readiness quarterly or after major architecture changes
  • Track unresolved readiness gaps as backlog items
  • Use incident findings to update support standards
  • Retire obsolete runbooks and contacts

AWS Services to Consider

AWS Systems Manager

Provides operational automation, inventory, and runbooks to reduce manual effort and improve day-2 operations.

AWS Systems Manager Incident Manager

Helps prepare response plans, escalation paths, and timeline tracking during incidents.

Amazon CloudWatch

Collects metrics, logs, alarms, and dashboards so teams can detect issues early and track operational outcomes.

AWS Well-Architected Tool

Captures workload reviews, risks, and improvement plans so teams can continuously track architecture quality.

AWS Config

Tracks resource configuration changes and evaluates compliance against operational policies.

Common Challenges and Solutions

Challenge: Incomplete runbooks

Solution: Define runbook quality standards and require validation through drills before launch.

Challenge: On-call overload

Solution: Improve alert quality and automate repetitive actions to reduce unnecessary pager volume.

Challenge: Gaps after major changes

Solution: Make readiness re-assessment mandatory after architecture or dependency changes.