ObservabilitéMay 4, 2026

Observability That Pays Off: OpenTelemetry, SLOs, On-Call

How to wire OpenTelemetry, SLIs/SLOs and error budgets into a sustainable on-call practice — without drowning your team in dashboards.

Most teams don't have an observability problem. They have a signal-to-noise problem. Dashboards multiply, alerts fire at 3 a.m. for symptoms nobody owns, and the on-call rotation slowly erodes engineering morale. The fix isn't more tooling — it's a coherent stack built around three things: OpenTelemetry as the data plane, SLOs as the contract, and error budgets as the decision-making tool.

Here's how senior teams are putting these pieces together in 2025.

OpenTelemetry: stop instrumenting twice

If you're still running a Datadog agent for metrics, Fluent Bit for logs, and a vendor-specific tracer for APM, you're paying the triple instrumentation tax. OpenTelemetry (OTel) is now stable across traces, metrics and logs (logs reached GA in the SDK in 2024), and it's the de-facto standard backed by AWS, Google, Microsoft, Datadog, Grafana and Splunk.

The pattern that works:

Instrument applications once with OTel SDKs (auto-instrumentation for Java, .NET, Python, Node).
Send everything to an OpenTelemetry Collector (gateway mode).
Fan out from the Collector to your backends — Tempo/Jaeger for traces, Prometheus/Mimir for metrics, Loki/Elastic for logs.

A minimal Collector pipeline that adds resource attributes and tail-samples noisy traces:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
    timeout: 5s
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 500 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

exporters:
  otlphttp/tempo:
    endpoint: http://tempo:4318
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push

service:
  pipelines:
    traces:  { receivers: [otlp], processors: [tail_sampling, batch], exporters: [otlphttp/tempo] }
    metrics: { receivers: [otlp], processors: [batch], exporters: [prometheusremotewrite] }

The payoff: vendor lock-in disappears, and you can renegotiate observability contracts without re-instrumenting 200 services.

The three pillars, ranked by ROI

Not all telemetry is equally valuable. After working with dozens of platform teams, the practical hierarchy looks like this:

| Signal | Cost | Debug value | Best for | |---|---|---|---| | Metrics | Low | Medium | SLOs, capacity, alerting | | Traces | Medium | High | Latency root cause, dependencies | | Logs | High | High | Forensics, business events |

A common anti-pattern: teams ingesting 2 TB/day of logs and 50 GB/day of traces. Invert it. Sample logs aggressively (or move to structured events à la Honeycomb), keep high-cardinality traces with tail sampling, and rely on metrics for what alerts on.

SLIs and SLOs: the contract with reality

An SLI (Service Level Indicator) is a measurement. An SLO (Objective) is a target. The discipline isn't picking 99.9% vs 99.95% — it's picking SLIs that actually correlate with user pain.

Good SLIs share three properties:

User-centric: measured as close to the user as possible (ingress, mobile RUM), not at the database.
Symptom, not cause: "checkout p95 latency > 800ms" beats "Redis CPU > 80%".
Aggregable over a window: typically 28 or 30 rolling days.

A Prometheus recording rule for an availability SLI:

- record: sli:checkout_availability:ratio_rate5m
  expr: |
    sum(rate(http_requests_total{service="checkout",code!~"5.."}[5m]))
    /
    sum(rate(http_requests_total{service="checkout"}[5m]))

Tools like Sloth or Pyrra generate the burn-rate alerts from a YAML SLO definition — don't write them by hand. Multi-window, multi-burn-rate alerts (the Google SRE workbook pattern) are non-negotiable if you want to alert on trend rather than spike.

Error budgets: the only governance tool that works

A 99.9% SLO over 30 days = 43 minutes of allowed downtime. That's your error budget. It's not aspirational — it's a political instrument:

Budget healthy → ship faster, run chaos experiments, deploy on Fridays.
Budget burned → freeze feature work, prioritize reliability, postpone risky migrations.

This stops the eternal argument between product and SRE. The data decides. In practice, expose burn rate on a shared dashboard and write a simple policy in your engineering handbook (e.g. "two consecutive months of budget exhaustion triggers a reliability sprint").

On-call that doesn't burn people out

Observability fails the moment your on-call engineer can't answer "is this real, and what do I do?" in under 60 seconds. A checklist we apply on every alert before it ships to PagerDuty / Opsgenie / Grafana OnCall:

[ ] Tied to an SLO or a hard safety constraint (no "CPU > 80%" alerts).
[ ] Has a runbook URL in the alert payload.
[ ] Includes a direct link to the relevant trace/dashboard with time range pre-filled.
[ ] Has a defined owner team (not a person).
[ ] Reviewed in the weekly on-call retro; flapping alerts are deleted or fixed.
[ ] Severity matches business impact (SEV1 wakes people, SEV3 waits for morning).

Track two metrics about your on-call itself: alerts per shift (target: < 2) and % actionable (target: > 75%). If either drifts, you have a tuning problem, not a staffing problem.

Key takeaways

Standardize on OpenTelemetry — single instrumentation, swappable backends, no more vendor tax.
Pick SLIs that mirror user experience, not infrastructure health; let SLOs and burn-rate alerts drive paging.
Error budgets are the contract between product velocity and reliability — make them visible and binding.
Every alert needs an owner, a runbook and an SLO link, or it shouldn't exist.
Measure your on-call quality (alerts/shift, % actionable) the same way you measure your services.

Share this article

Observability That Pays Off: OpenTelemetry, SLOs, On-Call

OpenTelemetry: stop instrumenting twice

The three pillars, ranked by ROI

SLIs and SLOs: the contract with reality

Error budgets: the only governance tool that works

On-call that doesn't burn people out

Key takeaways

Read also

DevSecOps in 2025: A Practical Pipeline Blueprint

AI Agents in Production: MCP, Tool Use, and Orchestration

Running Kubernetes at Scale: Beyond the Basics