PERF07 - How do you monitor your resources to ensure they are performing?
Best Practices
Best Practices
This question includes the following best practices:
- PERF07-BP01: Establish key performance indicators (KPIs) to measure workload health and performance
- PERF07-BP02: Use monitoring solutions to understand where performance is most critical
- PERF07-BP03: Define a process to improve workload performance
- PERF07-BP04: Load test your workload
- PERF07-BP05: Use automation to proactively remediate performance-related issues
- PERF07-BP06: Keep your workload and services up-to-date
- PERF07-BP07: Review metrics at regular intervals
Key Concepts
Performance Architecture Fundamentals
Performance telemetry: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Resource-level KPIs: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Alert design: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Optimization and Operations
Capacity forecasting: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Trend analysis: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Automated response: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.
Implementation Approach
1. Define monitoring strategy
- Select KPIs for compute, storage, database, and network layers
- Set service-level thresholds and alert severities
- Define dashboard standards for each ownership team
- Establish retention and granularity requirements
2. Instrument resources and services
- Enable native service metrics and logs
- Collect custom application metrics where needed
- Configure tracing for critical workflows
- Capture dependency and downstream performance signals
3. Automate detection and response
- Use alarms and event rules for threshold breaches
- Trigger automated remediation for known issues
- Escalate severe events to incident workflows
- Track alert quality and response times
4. Review and optimize
- Run periodic capacity and trend reviews
- Tune thresholds to minimize false positives
- Identify recurring hotspots and optimize proactively
- Expand monitoring scope for newly launched components
AWS Services to Consider
Amazon CloudWatch
Collects metrics, logs, alarms, and dashboards so teams can detect issues early and track operational outcomes.
AWS X-Ray
Traces distributed requests to identify latency bottlenecks and dependency failures across microservices.
Amazon EventBridge
Routes events between services and triggers automated responses for operational events.
AWS Lambda
Runs event-driven code without managing servers, ideal for automation and on-demand operational workflows.
AWS Systems Manager
Provides operational automation, inventory, and runbooks to reduce manual effort and improve day-2 operations.
Common Challenges and Solutions
Challenge: Too many low-value alarms
Solution: Refine thresholds using historical data and remove alerts without clear operator actions.
Challenge: Missing application-level metrics
Solution: Instrument business-critical paths beyond infrastructure metrics to catch user-impact issues.
Challenge: Reactive capacity planning
Solution: Use trend analysis and forecast windows to scale before resource saturation occurs.