Implementing the Three Pillars of Observability

Nov 9, 2021 334 words · 2 min read

Everyone talks about observability, but most organisations have monitoring with extra steps. We spent a year building genuine observability and learned what actually matters.

The Problem

We had monitoring. Lots of it. Dashboards for everything. Alert fatigue was constant. When incidents occurred, we could see something was wrong but not why.

Logs lived in one system, metrics in another, traces in a third (when they existed at all). Correlating across them required mental gymnastics and tribal knowledge. Junior engineers couldn’t debug production issues without senior help.

The data existed but didn’t connect. We had visibility without understanding.

Our Solution

Correlation IDs everywhere became the foundation. Every request received a unique trace ID at the edge, propagated through every service, included in every log line. This single change transformed debugging from archaeology to direct lookup.

Prometheus and Grafana handled metrics. We standardised on RED metrics (Rate, Errors, Duration) for services and USE metrics (Utilisation, Saturation, Errors) for resources. Consistent naming conventions meant dashboards could be templated.

Structured logging replaced freeform text. JSON logs with consistent fields enabled querying without regex. The same trace ID from metrics appeared in logs.

Distributed tracing with Jaeger showed request flow across services. When latency spiked, traces revealed which service in the chain was slow. No more guessing.

Linking between systems completed the picture. Grafana annotations showed deployments on metric graphs. Log links opened filtered views from trace spans. Each pillar referenced the others.

The Benefits

Mean time to diagnosis dropped dramatically. An alert fires, we click through to the affected time range, see the correlated traces, jump to the relevant logs, and understand the issue—all within minutes.

On-call rotations became sustainable. Engineers no longer needed years of context to debug effectively. The systems guide investigation rather than requiring intuition.

Proactive identification improved. Unusual patterns in traces surface before they become incidents. Gradual degradation shows in metrics before users notice.

Observability isn’t about having more data. It’s about having connected data that answers questions you haven’t thought to ask yet.