AWS CloudWatch Fundamentals

CloudWatch is AWS’s observability service — metrics, logs, alarms, dashboards, and traces. It’s not one product but a family of loosely related services that share the CloudWatch brand. Every other AWS service publishes to it by default, so CloudWatch is where you look first when something’s wrong.

What CloudWatch is (the big picture)

Sub-serviceWhat it does
CloudWatch MetricsNumeric time-series (CPU, requests, latency, custom values)
CloudWatch LogsLog aggregation from apps, services, agents
CloudWatch AlarmsThreshold-based notifications on metrics
CloudWatch DashboardsVisualisations combining metrics & logs
CloudWatch Events / EventBridgeEvent bus for AWS and app events (EventBridge is the modern name)
CloudWatch Logs InsightsSQL-ish query language over logs
CloudWatch SyntheticsScripted “canary” checks probing endpoints
CloudWatch RUMReal-user monitoring for browser apps
CloudWatch Application Signals / ServiceLensService-level overview
X-RayDistributed tracing (related but technically separate service)
Container Insights / Lambda InsightsPre-built dashboards for ECS/EKS/Lambda

Pricing is per sub-service and can add up — logs ingestion and high-resolution custom metrics are the common bill surprises.

Metrics — the numeric spine

A metric is a time series identified by three things:

  • Namespace — e.g. AWS/EC2, AWS/Lambda, Custom/myapp
  • Metric nameCPUUtilization, Invocations, OrderCount
  • Dimensions — key/value pairs further scoping: {InstanceId: i-123}, {FunctionName: my-fn}

Every unique combination of namespace × name × dimension set is a distinct metric (and a distinct bill line for custom metrics).

Resolution and retention

  • Standard resolution: 1-minute granularity
  • High resolution: 1-second granularity (costs ~3× more; used for latency-sensitive alarms)
  • Retention: 15 months automatic (aggregated at coarser intervals over time)
    • 60-second data: retained for 15 days
    • 5-min aggregates: 63 days
    • 1-hour aggregates: 455 days (15 months)

AWS-native vs custom metrics

  • Native — EC2, RDS, ALB, Lambda, S3, etc. publish metrics by default. Basic monitoring = 5-min; detailed = 1-min (paid).
  • Custom — your app publishes via PutMetricData or Embedded Metric Format (EMF) (JSON in logs that CloudWatch auto-parses into metrics — the efficient path for Lambda/containers).

Statistics and math

You query metrics by statistic: Sum, Average, Minimum, Maximum, SampleCount, percentiles (p50, p95, p99).

Metric Math lets you compose expressions: m1 / m2 * 100, ANOMALY_DETECTION_BAND(m1, 2), etc. Useful for derived KPIs in dashboards and alarms.

Logs — the string-shaped half

CloudWatch Logs organises log data into:

Log Group   → a logical container (e.g. /aws/lambda/my-fn, /var/log/app)
  Log Stream → a source within the group (e.g. per container, per file)
    Events   → timestamp + message

How logs get in:

  • AWS services publish directly (Lambda, API Gateway, VPC Flow Logs, CloudTrail, etc.)
  • CloudWatch Agent on EC2/on-prem — sends OS and app logs
  • Container log driversawslogs driver in Docker/ECS; sidecar/Fluent Bit in EKS
  • Direct APIPutLogEvents

Retention and storage classes

  • Default retention: “never expire” — a classic cost trap. Set a retention policy on every log group you care about.
  • Infrequent Access (IA) class — cheaper storage for logs you might query rarely
  • Log Group subscriptions — stream logs to Lambda, Kinesis, Firehose, OpenSearch for downstream processing

Logs Insights — the query language

A purpose-built query language for CloudWatch Logs:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)
| sort @timestamp desc
| limit 100

Fast for ad-hoc investigation. Not as powerful as OpenSearch / Splunk for complex correlation, but good enough for most incidents.

Metric filters

A metric filter scans log events and emits a metric when matched. Classic pattern: “count of ERROR messages → custom metric → alarm.”

Filter pattern: ERROR
Value: 1
Namespace: MyApp
Name: ErrorCount

This is how you turn unstructured logs into alertable data.

Alarms

An alarm watches one metric (or one metric-math expression) and transitions through states:

  • INSUFFICIENT_DATA — not enough data points yet
  • OK — within threshold
  • ALARM — threshold breached

Triggers: SNS topic, Auto Scaling action, EC2 action (reboot/stop/terminate/recover), Systems Manager OpsItem, Lambda via EventBridge.

Key knobs

  • Threshold + comparison> 80, <= 5, etc.
  • Evaluation periods / Datapoints to alarm — “3 out of 5 periods > 80” reduces flappiness
  • Treat missing datanotBreaching / breaching / ignore / missing — critical for alarms that can go quiet intentionally
  • Anomaly detection — ML-based baseline; breach = “outside the expected band” rather than fixed threshold

Composite alarms

Combine multiple alarms with boolean logic:

(CPUHigh AND MemHigh) OR DiskFull Useful to reduce noise — only page if multiple symptoms coincide.

CloudWatch Agent — the universal collector

The CloudWatch Agent is a cross-platform daemon for EC2 / on-prem that collects:

  • System metrics not in the default EC2 set (memory, disk, swap, custom procstat)
  • Log files
  • StatsD / collectd metrics

Configured via JSON, installed via SSM or user-data. On modern instances it replaces the older CloudWatch Logs Agent + custom memory-scripts dance.

Why it matters: EC2’s default metrics don’t include memory or disk-space-used (surprising, because those are most common alarms). You install the agent or you live with the blind spots.

Dashboards

JSON-defined widgets combining metrics, logs, text, and alarms. Share across accounts via cross-account observability. Also programmable via IaC.

Dashboard tip: one dashboard per service/team, not one mega-dashboard. People don’t scroll; they skim.

EventBridge (the CloudWatch Events evolution)

EventBridge is technically a separate service now, but the lineage is “CloudWatch Events + schema registry + SaaS partner buses.” Pattern:

Source → event bus → rule (pattern matching) → target

Typical AWS sources: any AWS service state change, CloudTrail API calls, scheduled rules (cron). Targets: Lambda, SQS, SNS, Step Functions, another event bus.

Canonical uses:

  • “When an EC2 instance stops, do X” — without writing a poller
  • Scheduled jobs (cron replacements)
  • Cross-account event routing via bus-to-bus

Cost traps

  1. Custom metrics bill. Each unique namespace × name × dimension set is ~$0.30/month baseline + per-million PutMetricData. Apps emitting metrics with high-cardinality dimensions (request ID, user ID) generate thousands of distinct metrics. Use EMF + aggregation, and keep dimensions coarse.
  2. Log retention “never expire” — set retention on every log group. Old logs that nobody reads still cost storage.
  3. High-resolution metrics cost more — don’t use 1-sec resolution unless your alarm needs sub-minute reaction.
  4. Container Insights enhanced mode charges per container — reasonable for prod, expensive across dev clusters.
  5. Cross-region dashboards / metric queries — pull data from each region; noisier billing line.

How everything fits together

   APP / LAMBDA / EC2
   │    │    │   │
   │    │    │   └──→ CloudWatch Agent ──→ Logs + System Metrics
   │    │    └──────→ Service Native ────→ Namespace metrics
   │    └──────────── EMF JSON ──────────→ Auto-parsed custom metrics
   └───────────────── PutMetricData ─────→ Custom metrics

           Logs  ──→  Metric Filter  ──→  Alarm ──→  SNS ──→  PagerDuty
           Metrics ─→  Alarm         ──→  Auto Scaling / SNS / EventBridge
           Metrics ─→  Dashboard
           Events ──→  EventBridge   ──→  Lambda / Target
           Alarms ──→  Composite     ──→  Single noise-reduced page

Common pitfalls

  1. Missing the blind spots. EC2 doesn’t emit memory metrics — install CW Agent.
  2. Alarms on Average when you want Max or p99. Average hides tail spikes.
  3. Alarms going silent during outages. If your Lambda crashes, invocations = 0, and “Errors > threshold” never fires because there’s no data. Use Treat missing data as breaching.
  4. Infinite log retention. Set it or pay forever.
  5. High-cardinality custom dimensions exploding the metric count. Use EMF with aggregation.
  6. Alarms on the wrong metric. E.g. “CPU > 80%” on an auto-scaling web tier — irrelevant; what matters is request latency or 5xx rate. Align alarms with SLOs.
  7. Dashboards nobody reads. Focus on the handful of metrics that tell the SLO story.

Mental model

  • Metrics = numbers in time. Alarms watch them. Dashboards display them.
  • Logs = strings. Logs Insights queries them; metric filters convert them to numbers.
  • Alarms = eventing layer bridging metrics to humans / actions.
  • EventBridge = AWS + SaaS event bus that uses CloudWatch’s rule engine.
  • X-Ray = traces — spans joining across services.
  • Everything flows here by default — which is why the first place to investigate an AWS incident is CloudWatch, and the first ops lesson in AWS is “set alarms, retention, and the agent.”

See also