OPS10-BP01 - Use a process for event, incident, and problem management
Implementation Guidance
“Use a process for event, incident, and problem management” should be delivered as a standard operating capability with explicit scope, controls, and validation checkpoints. Embed it into day-to-day engineering and operations workflows.
For the question “How do you manage workload and operations events?”, define measurable outcomes, assign owners, and review execution regularly. Integrate this practice into delivery and operations processes so improvements persist as workloads and requirements evolve.
Key Steps
-
Define implementation scope and outcomes:
- Set explicit success criteria for “Use a process for event, incident, and problem management”
- Identify dependencies, prerequisites, and sequencing constraints
- Assign accountable owners for execution and maintenance
-
Implement with standards and validation:
- Use reusable templates and runbooks for consistent execution
- Validate implementation with tests, checks, or controlled rollouts
- Capture telemetry to confirm adoption and effectiveness
-
Operate and iterate:
- Review outcomes against KPIs on a recurring schedule
- Fix recurring failure modes and process bottlenecks
- Update implementation guidance based on operational learning
Risk / Impact
Level of risk if not implemented: Medium
Impact: Without this best practice, workloads typically accumulate inefficiencies and execution drift that increase failure probability over time. Problems often surface during traffic spikes, major releases, or dependency failures.
Benefits of implementation:
- More predictable operational and engineering outcomes
- Better alignment between architecture decisions and business goals
- Continuous improvement through measurable feedback loops
AWS Services to Consider
Amazon EventBridge
Routes events and triggers automation workflows for rapid operational response.
AWS Systems Manager Incident Manager
Coordinates incident response with predefined plans, contacts, and timelines.
Amazon SNS
Sends notifications to people and systems for incidents and operational events.
AWS Lambda
Runs event-driven automation without managing servers, ideal for remediation workflows.