Observability: metrics, logs, traces
What each is for and how they fit together.
Observability is the discipline of being able to understand a running system from the outside. The three classic pillars (metrics, logs, traces) are not interchangeable; each answers a different kind of question, costs a different amount, and shines in different parts of an incident. A real production system uses all three, and the senior skill is knowing which one to reach for first when something breaks.
The three pillars at a glance
flowchart TB
subgraph M["Metrics — counters and gauges over time"]
direction LR
M1[("'requests per second',<br/>'p99 latency', 'error rate'")]:::infra
M2[("aggregated, low cardinality<br/>cheap to store forever")]:::strong
M3["best at: dashboards, alerting,<br/>trend analysis"]:::strong
end
subgraph L["Logs — discrete events with context"]
direction LR
L1[("'user 42 tried to delete project 99<br/>and got 403 at 14:23:17'")]:::infra
L2[("high detail per event<br/>expensive at high volume")]:::mid
L3["best at: incident detective work,<br/>'what exactly happened?'"]:::strong
end
subgraph T["Traces — one request across many services"]
direction LR
T1[("'request abc spent 12 ms in API,<br/>800 ms in DB, 200 ms in payments'")]:::infra
T2[("structured causal graph<br/>often sampled to control cost")]:::mid
T3["best at: latency hunts,<br/>cross-service flow questions"]:::strong
end
classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef mid fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef strong fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
Metrics tell you something is wrong. Logs tell you what happened. Traces tell you where the time went. They are complementary, not redundant.
Metrics: the dashboard layer
Metrics are numbers sampled over time, aggregated. They compress huge amounts of activity into small, cheap numbers you can chart and alert on. Each one is a time series.
Three useful types:
- Counters. “Total HTTP requests served.” Always go up.
- Gauges. “Current open connections.” Goes up and down.
- Histograms. “Latency distribution.” Lets you compute p50, p99, max.
flowchart LR
APP[("Application emits<br/>requests_total, latency_ms,<br/>queue_depth, ...")]:::server
AG[["Aggregator (Prometheus,<br/>Datadog agent, OTel collector)"]]:::infra
TS[("Time-series database<br/>Prometheus, InfluxDB,<br/>cloud-native TSDB")]:::store
DASH(["Dashboards + alerts"]):::client
APP ==> AG ==> TS ==> DASH
classDef server fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef store fill:#e9d5ff,stroke:#7e22ce,color:#581c87,stroke-width:1.5px
classDef client fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px
A million requests become 60 datapoints per minute per metric. Storage is tiny. Alerts can be written cleanly: “fire when p99 latency > 500 ms for 5 minutes.” Metrics are how you find out anything is wrong at all. See Time-series databases.
Logs: the detective layer
A log line is a single event with structured (or semi-structured) fields: timestamp, level, message, context. Logs are detailed but voluminous; every line is a few bytes per request, multiplied by every request.
Once you know something is wrong (from a metric), logs are where you go to find out what specifically happened. “Show me every error from the orders service in the last 10 minutes” is a log query.
Modern practice: structured logs. JSON, with consistent field names. So you can filter by user_id, request_id, tenant_id instead of grepping unstructured text.
flowchart LR
APP[("Application emits<br/>JSON log lines")]:::server
SHIP[["Log shipper<br/>(Fluent Bit, Vector,<br/>Filebeat, OTel)"]]:::infra
STORE[("Log store<br/>Loki, Elasticsearch, CloudWatch,<br/>BigQuery, Datadog")]:::store
UI(["Query UI"]):::client
APP ==> SHIP ==> STORE ==> UI
classDef server fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef store fill:#e9d5ff,stroke:#7e22ce,color:#581c87,stroke-width:1.5px
classDef client fill:#dbeafe,stroke:#1e40af,color:#1e3a8a,stroke-width:1.5px
Logs grow fast. A modest service can produce gigabytes per day. Tiered storage (hot for 14 days, warm for 90, cold for compliance) is the standard answer. See Hot, warm, cold storage tiers.
Traces: the cross-service layer
A trace follows one request across many services. Each hop is a “span” with a start time, duration, and parent. The result is a tree showing where a single user request spent its time.
flowchart TB
subgraph T["A single trace for one request — 1.2 seconds total"]
direction TB
SP1["API gateway<br/>span: 0 - 1200 ms"]:::infra
SP2["auth service<br/>span: 5 - 35 ms (30 ms)"]:::server
SP3["product service<br/>span: 40 - 800 ms (760 ms)"]:::server
SP4["database query<br/>span: 100 - 780 ms (680 ms)"]:::store
SP5["recommender<br/>span: 820 - 1180 ms (360 ms)"]:::server
SP1 --> SP2
SP1 --> SP3
SP3 --> SP4
SP1 --> SP5
end
NOTE["The database query was 680 ms of the 1200 ms request.<br/>That is where the optimisation work belongs."]:::infra
T --> NOTE
classDef infra fill:#fef3c7,stroke:#a16207,color:#713f12,stroke-width:1.5px
classDef server fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1.5px
classDef store fill:#e9d5ff,stroke:#7e22ce,color:#581c87,stroke-width:1.5px
Traces are how you answer “this one request was slow; where did the time go?” without grepping logs across five services and stitching timestamps by hand. They are essential the moment you have more than a couple of services. See Microservices vs monolith.
Because traces store per-request structured data, they get expensive at scale. Sampling is standard: keep 1-5% of traces, plus 100% of error or slow traces. Open standards (OpenTelemetry) handle the wiring.
How they fit together
In a real incident, you traverse them in order:
sequenceDiagram
autonumber
participant ONCALL as On-call engineer
participant MET as Metrics dashboard
participant TR as Traces
participant LG as Logs
Note over ONCALL: alert fires: p99 latency spiked
ONCALL->>MET: which endpoint? which region?
MET-->>ONCALL: /api/orders, eu-west, started 14:20
Note over ONCALL: now find an example
ONCALL->>TR: slow traces for /api/orders since 14:20
TR-->>ONCALL: 12 traces, all stuck in DB span for ~3s
Note over ONCALL: what exactly did the DB say?
ONCALL->>LG: logs for trace_id = abc123
LG-->>ONCALL: "query waiting on lock; pid 4711 holding"
Note over ONCALL: root cause: a long-running migration holding a row lock
Metric → trace → log is the standard incident path. Each layer narrows the question. You cannot do this from any one of them alone; you need the chain.
The fourth pillar most teams forget: events
Beyond the three pillars, the change log of the system (deploys, feature flag flips, infrastructure changes, scaling events) is usually the answer to “what changed at 14:20?” Wire deploys into the same observability stack so a spike on a chart is visually next to “deploy v452 at 14:19.” Half of all incidents are explained the moment you correlate with a deploy.
What this connects to
- Time-series databases. Where metrics live. See Time-series databases.
- Health checks. Liveness, readiness, and startup probes are themselves observability signals. See Health checks: liveness vs readiness vs startup.
- Microservices vs monolith. Tracing is essential the moment you split. See Microservices vs monolith.
- Storage tiers. Logs and traces are classic candidates for tiered retention. See Hot, warm, cold storage tiers.
- Circuit breaker. Breakers should expose state as metrics. See Circuit breaker.
Common mistakes
- Logs as your only observability tool. Searching gigabytes of text to find a p99 spike is slow and expensive. Add metrics.
- No correlation IDs. Without a
request_idortrace_idfield on every log line, you cannot tie events across services. Inject one at the edge; propagate it through every hop. - High-cardinality labels on metrics. One metric per user means millions of series. Costs explode. Reserve cardinality for trace data, not metrics.
- Sampling traces by random chance without keeping errors. Sampling at 1% means most failing traces are gone. Always keep 100% of errors and slow requests.
- Logging secrets. Tokens, passwords, full request bodies. The logs become a liability the moment they leak. Redact at emit time. See Secrets management.
- No alert review. Alerts that fire constantly are ignored constantly. Aggressively tune and delete the ones nobody acts on.
- Three separate vendors for the three pillars with no glue. The correlation step at 3 AM is what saves you; if metric → trace → log requires three logins and re-typing IDs, you have failed.
Quick recap
- Metrics: numbers over time. Cheap, dashboards, alerts. Tells you something is wrong.
- Logs: structured events with context. Detailed, expensive at scale. Tells you what happened.
- Traces: one request across many services. Causal graph. Tells you where the time went.
- Use all three. Correlate via request_id / trace_id on every event.
- Wire deploys and feature flag changes into the same view; half of all incidents trace to “what just changed?”
This concept sits in Stage 4 (Scaling and reliability) of the System Design Roadmap.