PERF07 - How do you monitor your resources to ensure they are performing?

Best Practices

PERF07-BP01 BP01 - Establish key performance indicators (KPIs) to measure workload health and performance PERF07-BP02 BP02 - Use monitoring solutions to understand where performance is most critical PERF07-BP03 BP03 - Define a process to improve workload performance PERF07-BP04 BP04 - Load test your workload PERF07-BP05 BP05 - Use automation to proactively remediate performance-related issues PERF07-BP06 BP06 - Keep your workload and services up-to-date PERF07-BP07 BP07 - Review metrics at regular intervals

Best Practices

This question includes the following best practices:

Key Concepts

Performance Architecture Fundamentals

Performance telemetry: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Resource-level KPIs: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Alert design: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Optimization and Operations

Capacity forecasting: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Trend analysis: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Automated response: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Implementation Approach

1. Define monitoring strategy

Select KPIs for compute, storage, database, and network layers
Set service-level thresholds and alert severities
Define dashboard standards for each ownership team
Establish retention and granularity requirements

2. Instrument resources and services

Enable native service metrics and logs
Collect custom application metrics where needed
Configure tracing for critical workflows
Capture dependency and downstream performance signals

3. Automate detection and response

Use alarms and event rules for threshold breaches
Trigger automated remediation for known issues
Escalate severe events to incident workflows
Track alert quality and response times

4. Review and optimize

Run periodic capacity and trend reviews
Tune thresholds to minimize false positives
Identify recurring hotspots and optimize proactively
Expand monitoring scope for newly launched components

AWS Services to Consider

Amazon CloudWatch

Collects metrics, logs, alarms, and dashboards so teams can detect issues early and track operational outcomes.

AWS X-Ray

Traces distributed requests to identify latency bottlenecks and dependency failures across microservices.

Amazon EventBridge

Routes events between services and triggers automated responses for operational events.

AWS Lambda

Runs event-driven code without managing servers, ideal for automation and on-demand operational workflows.

AWS Systems Manager

Provides operational automation, inventory, and runbooks to reduce manual effort and improve day-2 operations.

Common Challenges and Solutions

Challenge: Too many low-value alarms

Solution: Refine thresholds using historical data and remove alerts without clear operator actions.

Challenge: Missing application-level metrics

Solution: Instrument business-critical paths beyond infrastructure metrics to catch user-impact issues.

Challenge: Reactive capacity planning

Solution: Use trend analysis and forecast windows to scale before resource saturation occurs.

PERF07 - How do you monitor your resources to ensure they are performing?

Best Practices

Best Practices

Key Concepts

Performance Architecture Fundamentals

Optimization and Operations

Implementation Approach

1. Define monitoring strategy

2. Instrument resources and services

3. Automate detection and response

4. Review and optimize

AWS Services to Consider

Amazon CloudWatch

AWS X-Ray

Amazon EventBridge

AWS Lambda

AWS Systems Manager

Common Challenges and Solutions

Challenge: Too many low-value alarms

Challenge: Missing application-level metrics

Challenge: Reactive capacity planning

Related Resources

Related Resources