Skip to content

PERF07 - How do you monitor your resources to ensure they are performing?

Best Practices

Best Practices

This question includes the following best practices:

Key Concepts

Performance Architecture Fundamentals

Performance telemetry: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Resource-level KPIs: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Alert design: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Optimization and Operations

Capacity forecasting: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Trend analysis: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Automated response: Use this concept to guide architecture and operating decisions for this question area. Define measurable targets, assign clear ownership, and review results regularly against expected business outcomes.

Implementation Approach

1. Define monitoring strategy

  • Select KPIs for compute, storage, database, and network layers
  • Set service-level thresholds and alert severities
  • Define dashboard standards for each ownership team
  • Establish retention and granularity requirements

2. Instrument resources and services

  • Enable native service metrics and logs
  • Collect custom application metrics where needed
  • Configure tracing for critical workflows
  • Capture dependency and downstream performance signals

3. Automate detection and response

  • Use alarms and event rules for threshold breaches
  • Trigger automated remediation for known issues
  • Escalate severe events to incident workflows
  • Track alert quality and response times

4. Review and optimize

  • Run periodic capacity and trend reviews
  • Tune thresholds to minimize false positives
  • Identify recurring hotspots and optimize proactively
  • Expand monitoring scope for newly launched components

AWS Services to Consider

Amazon CloudWatch

Collects metrics, logs, alarms, and dashboards so teams can detect issues early and track operational outcomes.

AWS X-Ray

Traces distributed requests to identify latency bottlenecks and dependency failures across microservices.

Amazon EventBridge

Routes events between services and triggers automated responses for operational events.

AWS Lambda

Runs event-driven code without managing servers, ideal for automation and on-demand operational workflows.

AWS Systems Manager

Provides operational automation, inventory, and runbooks to reduce manual effort and improve day-2 operations.

Common Challenges and Solutions

Challenge: Too many low-value alarms

Solution: Refine thresholds using historical data and remove alerts without clear operator actions.

Challenge: Missing application-level metrics

Solution: Instrument business-critical paths beyond infrastructure metrics to catch user-impact issues.

Challenge: Reactive capacity planning

Solution: Use trend analysis and forecast windows to scale before resource saturation occurs.