Thumbnail

What SREs Need from Developers to Build Great Monitoring

Maki DizonMaki Dizon11/7/2025

When people think of Site Reliability Engineering (SRE), they often imagine dashboards, alerts, and uptime numbers. But the truth is, none of that is possible without good collaboration between developers and SREs.

Monitoring isn’t magic — it’s built on top of metrics that developers expose. If those metrics aren’t well-defined or instrumented, the SRE team can’t build meaningful Service Level Objectives (SLOs), error budgets, or alerts.

So what exactly does an SRE need from the developers or service owners to make this work? Let’s walk through the essentials.


🔹 1. Meaningful, Well-Instrumented Metrics

Everything starts with metrics.

Developers need to instrument their code to expose data that describes how the system behaves — ideally using a standard like Prometheus, OpenTelemetry, or StatsD.

The core categories:

  • Request metrics: requests_total, requests_failed_total
  • Latency metrics: request_duration_seconds_bucket
  • Resource metrics: CPU, memory, queue depth
  • Business metrics: checkout_success_total, payments_failed_total

Why this matters: SREs use these metrics to calculate SLIs (Service Level Indicators) — the mathematical basis for SLOs and error budgets.

💡 Tip: Use consistent naming conventions and labels like region, status, or endpoint. It makes querying and aggregating much easier.


🔹 2. Clear Definitions of Success and Failure

SREs can’t guess what “bad” looks like.

They need a clear signal from the service owners:

  • Is an HTTP 500 a failure? (yes)
  • What about 404? (maybe not)
  • Is a response slower than 1 second a degradation? (depends on the product)

These definitions become the SLIs that SREs monitor.

Without them, you end up with false alarms or misleading dashboards — and no one trusts the data.


🔹 3. Health Endpoints

Reliability isn’t just about code metrics. SREs also need black-box probes that simulate user behavior.

Every service should expose:

  • /healthz → Is the process running?
  • /readyz → Can it handle traffic?
  • /metrics → Emits telemetry for Prometheus or other collectors

This allows monitoring systems to know when the service is truly “up,” not just “alive.”


🔹 4. Structured Logging and Tracing

Metrics tell you what’s happening. Logs and traces tell you why.

SREs rely on:

  • Structured logs (JSON format preferred)
  • Trace IDs in every request log
  • Integration with centralized tools like ELK, Loki, or Jaeger

This lets them trace a single user request through multiple microservices — invaluable during incident debugging.


🔹 5. Metadata and Ownership Information

Every alert must have a destination. SREs need to know who owns a service and where to reach them.

A good setup includes:

  • Service name, repo, and deployment environment
  • Team ownership info (Slack channel, email, or on-call rotation)
  • Runbooks for incident response

If the service pages the wrong team (or no one at all), uptime numbers stop mattering.


🔹 6. Baseline Performance and Expectations

To create meaningful SLOs, SREs need historical data:

  • How often does the service fail now?
  • What’s the 95th percentile latency?
  • How stable is it under load?

That baseline helps propose achievable SLOs (e.g., “let’s move from 99.5% → 99.9% availability”) instead of arbitrary numbers that no one can meet.


🔹 7. Deployment and Version Metadata

When reliability changes, you need to know what changed.

SREs rely on deployment metadata:

  • Version numbers in metrics (e.g., build_version)
  • Deployment markers in Grafana dashboards
  • CI/CD events published to monitoring (e.g., via webhooks)

That way, when the burn rate spikes, you can immediately see: “Oh, this started right after version 3.4.2 rolled out.”


🔹 8. Alert Routing and Incident Hooks

When SLOs are breached, alerts must flow to the right place.

That means SREs need:

  • PagerDuty or Opsgenie integration
  • Slack alert channels
  • Clear escalation paths
  • Incident templates for quick triage

Without this plumbing, even the best monitoring doesn’t help — it just shouts into the void.


🔹 9. Capacity and Dependency Information

Modern systems rarely run alone. SREs need to understand dependencies — both upstream and downstream.

  • Which services call yours?
  • Which databases or APIs do you rely on?
  • What are the scaling or throughput limits?

This helps build dependency-aware alerting, so you can tell whether a problem originates in your code or an external service.


🔹 10. Ongoing Collaboration and Review

Monitoring is not a “set it and forget it” process.

As systems evolve, so must the metrics, dashboards, and SLOs. SREs and developers should review:

  • Are SLOs still relevant to user experience?
  • Are alerts actionable?
  • Are we burning too much or too little error budget?

A quarterly or monthly SLO review keeps reliability aligned with product goals.


🗊 The Takeaway

The SRE’s job is to turn telemetry into reliability — but they can only do that if the telemetry exists and makes sense.

Developers ProvideSREs Build
Metrics & health endpointsSLIs, SLOs, error budgets
Success/failure definitionsAlerts and dashboards
Ownership metadataIncident routing
Logs and tracesRoot-cause analysis tools

When both sides collaborate, the result is powerful:

  • Metrics tell you what is happening.
  • SLOs tell you how good it needs to be.
  • Error budgets tell you when to act.

That’s the foundation of reliability at scale.


Comments

No comments yet.