OPS07 - How do you know that you are ready to support a workload?
Best Practices
Best Practices
This question includes the following best practices:
- OPS07-BP01: Ensure personnel capability
- OPS07-BP02: Ensure a consistent review of operational readiness
- OPS07-BP03: Use runbooks to perform procedures
- OPS07-BP04: Use playbooks to investigate issues
- OPS07-BP05: Make informed decisions to deploy systems and changes
- OPS07-BP06: Create support plans for production workloads
Key Concepts
Strategy and Governance
Operational readiness: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Runbook completeness: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Support model design: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Operational Execution
Game days and drills: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Escalation readiness: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Service launch criteria: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Implementation Approach
1. Define readiness standards
- Document minimum readiness criteria for production support
- Ensure runbooks cover normal and failure operations
- Define on-call model, escalation path, and ownership
- Set service-level objectives and alerting expectations
2. Validate operational artifacts
- Run tabletop exercises for key incident scenarios
- Test backup, restore, and failover procedures
- Confirm dashboards and alarms are complete and actionable
- Verify access controls and break-glass procedures
3. Launch with guardrails
- Use launch checklists before production changes
- Require support handoff signoff from engineering and operations
- Ensure knowledge transfer for tier-1 and tier-2 responders
- Run post-launch hypercare for critical workloads
4. Continuously assess readiness
- Audit readiness quarterly or after major architecture changes
- Track unresolved readiness gaps as backlog items
- Use incident findings to update support standards
- Retire obsolete runbooks and contacts
AWS Services to Consider
AWS Systems Manager
Provides operational automation, inventory, and runbooks to reduce manual effort and improve day-2 operations.
AWS Systems Manager Incident Manager
Helps prepare response plans, escalation paths, and timeline tracking during incidents.
Amazon CloudWatch
Collects metrics, logs, alarms, and dashboards so teams can detect issues early and track operational outcomes.
AWS Well-Architected Tool
Captures workload reviews, risks, and improvement plans so teams can continuously track architecture quality.
AWS Config
Tracks resource configuration changes and evaluates compliance against operational policies.
Common Challenges and Solutions
Challenge: Incomplete runbooks
Solution: Define runbook quality standards and require validation through drills before launch.
Challenge: On-call overload
Solution: Improve alert quality and automate repetitive actions to reduce unnecessary pager volume.
Challenge: Gaps after major changes
Solution: Make readiness re-assessment mandatory after architecture or dependency changes.