Observability That Pays Off: OpenTelemetry, SLOs, On-Call
How to wire OpenTelemetry, SLIs/SLOs and error budgets into a sustainable on-call practice — without drowning your team in dashboards.
Most teams don't have an observability problem. They have a signal-to-noise problem. Dashboards multiply, alerts fire at 3 a.m. for symptoms nobody owns, and the on-call rotation slowly erodes engineering morale. The fix isn't more tooling — it's a coherent stack built around three things: OpenTelemetry as the data plane, SLOs as the contract, and error budgets as the decision-making tool.
Here's how senior teams are putting these pieces together in 2025.
OpenTelemetry: stop instrumenting twice
If you're still running a Datadog agent for metrics, Fluent Bit for logs, and a vendor-specific tracer for APM, you're paying the triple instrumentation tax. OpenTelemetry (OTel) is now stable across traces, metrics and logs (logs reached GA in the SDK in 2024), and it's the de-facto standard backed by AWS, Google, Microsoft, Datadog, Grafana and Splunk.
The pattern that works:
- Instrument applications once with OTel SDKs (auto-instrumentation for Java, .NET, Python, Node).
- Send everything to an OpenTelemetry Collector (gateway mode).
- Fan out from the Collector to your backends — Tempo/Jaeger for traces, Prometheus/Mimir for metrics, Loki/Elastic for logs.
A minimal Collector pipeline that adds resource attributes and tail-samples noisy traces:
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 5s
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 500 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }
exporters:
otlphttp/tempo:
endpoint: http://tempo:4318
prometheusremotewrite:
endpoint: http://mimir:9009/api/v1/push
service:
pipelines:
traces: { receivers: [otlp], processors: [tail_sampling, batch], exporters: [otlphttp/tempo] }
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheusremotewrite] }
The payoff: vendor lock-in disappears, and you can renegotiate observability contracts without re-instrumenting 200 services.
The three pillars, ranked by ROI
Not all telemetry is equally valuable. After working with dozens of platform teams, the practical hierarchy looks like this:
| Signal | Cost | Debug value | Best for | |---|---|---|---| | Metrics | Low | Medium | SLOs, capacity, alerting | | Traces | Medium | High | Latency root cause, dependencies | | Logs | High | High | Forensics, business events |
A common anti-pattern: teams ingesting 2 TB/day of logs and 50 GB/day of traces. Invert it. Sample logs aggressively (or move to structured events à la Honeycomb), keep high-cardinality traces with tail sampling, and rely on metrics for what alerts on.
SLIs and SLOs: the contract with reality
An SLI (Service Level Indicator) is a measurement. An SLO (Objective) is a target. The discipline isn't picking 99.9% vs 99.95% — it's picking SLIs that actually correlate with user pain.
Good SLIs share three properties:
- User-centric: measured as close to the user as possible (ingress, mobile RUM), not at the database.
- Symptom, not cause: "checkout p95 latency > 800ms" beats "Redis CPU > 80%".
- Aggregable over a window: typically 28 or 30 rolling days.
A Prometheus recording rule for an availability SLI:
- record: sli:checkout_availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{service="checkout",code!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))
Tools like Sloth or Pyrra generate the burn-rate alerts from a YAML SLO definition — don't write them by hand. Multi-window, multi-burn-rate alerts (the Google SRE workbook pattern) are non-negotiable if you want to alert on trend rather than spike.
Error budgets: the only governance tool that works
A 99.9% SLO over 30 days = 43 minutes of allowed downtime. That's your error budget. It's not aspirational — it's a political instrument:
- Budget healthy → ship faster, run chaos experiments, deploy on Fridays.
- Budget burned → freeze feature work, prioritize reliability, postpone risky migrations.
This stops the eternal argument between product and SRE. The data decides. In practice, expose burn rate on a shared dashboard and write a simple policy in your engineering handbook (e.g. "two consecutive months of budget exhaustion triggers a reliability sprint").
On-call that doesn't burn people out
Observability fails the moment your on-call engineer can't answer "is this real, and what do I do?" in under 60 seconds. A checklist we apply on every alert before it ships to PagerDuty / Opsgenie / Grafana OnCall:
- [ ] Tied to an SLO or a hard safety constraint (no "CPU > 80%" alerts).
- [ ] Has a runbook URL in the alert payload.
- [ ] Includes a direct link to the relevant trace/dashboard with time range pre-filled.
- [ ] Has a defined owner team (not a person).
- [ ] Reviewed in the weekly on-call retro; flapping alerts are deleted or fixed.
- [ ] Severity matches business impact (SEV1 wakes people, SEV3 waits for morning).
Track two metrics about your on-call itself: alerts per shift (target: < 2) and % actionable (target: > 75%). If either drifts, you have a tuning problem, not a staffing problem.
Key takeaways
- Standardize on OpenTelemetry — single instrumentation, swappable backends, no more vendor tax.
- Pick SLIs that mirror user experience, not infrastructure health; let SLOs and burn-rate alerts drive paging.
- Error budgets are the contract between product velocity and reliability — make them visible and binding.
- Every alert needs an owner, a runbook and an SLO link, or it shouldn't exist.
- Measure your on-call quality (alerts/shift, % actionable) the same way you measure your services.
Read also
- DevSecOpsMay 14, 2026
DevSecOps in 2025: A Practical Pipeline Blueprint
Shift-left security is a discipline, not a slogan. Here's how to wire SBOMs, SAST/DAST and secret scanning into CI without slowing your teams down.
Read article - Agents IA & automatisationMay 11, 2026
AI Agents in Production: MCP, Tool Use, and Orchestration
From autonomous agents to multi-agent orchestration with MCP and LangGraph — what actually works in enterprise settings, with patterns, pitfalls and code.
Read article - Kubernetes & Cloud NativeMay 7, 2026
Running Kubernetes at Scale: Beyond the Basics
Operators, GitOps, service mesh and zero-trust security: what actually matters when your K8s footprint crosses 50 clusters.
Read article