Skip to content

SLOs, SLIs, and Error Budgets: The Core of SRE

Estimated time to read: 5 minutes

In modern software operations, "100% uptime" is a myth. Every system eventually fails. Site Reliability Engineering (SRE) accepts this reality and provides a mathematical framework for managing it.

This guide explores the three pillars of SRE reliability: SLIs, SLOs, and Error Budgets.

For a comparison of how SRE differs from DevOps and Platform Engineering, see DevOps, SRE, and Platform Engineering: A Comparative Guide.


1. Service Level Indicators (SLI)

An SLI is a quantitative measure of some aspect of the level of service that is provided. It answers the question: "How is the service performing right now?"

Common SLIs include: * Availability: The percentage of time the service is usable (e.g., successful HTTP 2xx/3xx responses). * Latency: The time it takes to service a request (usually measured as P50, P95, or P99). * Throughput: The number of requests processed per second. * Error Rate: The percentage of requests that fail. * Freshness: (For data pipelines) How recently the data was updated.

The SLI Equation

Most SLIs are expressed as a ratio: $\(\text{SLI} = \frac{\text{Good Events}}{\text{Valid Events}} \times 100\)$


2. Service Level Objectives (SLO)

An SLO is a target value or range of values for a service level that is measured by an SLI. It answers the question: "How good do we want the service to be?"

While an SLI is a measurement, an SLO is a goal.

Examples:

  • Availability SLO: 99.9% of requests over a rolling 30-day window should return a 200 OK status.
  • Latency SLO: 90% of requests over a rolling 30-day window should complete in less than 200ms.

Why not 100%?

Setting an SLO of 100% is usually a mistake because: 1. Users don't notice the difference: Your users' own internet connection likely has lower than 100% availability. 2. Cost increases exponentially: The jump from 99.9% to 99.99% reliability often costs 10x more in engineering time and infrastructure. 3. It kills innovation: If you are not allowed to fail, you are not allowed to deploy new features.


3. The Error Budget

The Error Budget is the most powerful concept in SRE. It is derived directly from the SLO and represents the amount of "unreliability" you are allowed to have.

\[\text{Error Budget} = 100\% - \text{SLO}\]

If your SLO is 99.9%, your Error Budget is 0.1%.

How to use the Error Budget:

The Error Budget acts as a neutral arbiter between Development (who want speed) and Operations (who want stability).

  • If the budget is full: The team can move fast, deploy risky features, and experiment.
  • If the budget is nearly empty: The team must slow down, focus on reliability features, and improve testing.
  • If the budget is spent: All non-emergency changes are frozen until the budget recovers (or the window rolls over).

4. Connecting the Frameworks

Reliability management doesn't happen in a vacuum. It interacts with all other OpsAtScale modules:

  • DORA Metrics: Your SLOs directly influence your Change Failure Rate and Failed Deployment Recovery Time. See DORA and SPACE Metrics.
  • Observability: You cannot measure SLIs without a mature observability stack. See Observability Quality Metrics.
  • Incident Management: When an SLO is breached, it triggers a formal incident and a Root Cause Analysis / Postmortem to ensure the Error Budget was spent for a good reason.

5. Practical Implementation Checklist

Phase 1: Exploration

  • Identify your critical user journeys (e.g., "User can checkout", "User can search").
  • Choose SLIs that accurately reflect the health of those journeys.
  • Define what a "Good Event" and a "Valid Event" looks like.

Phase 2: Definition

  • Set attainable SLOs based on historical data.
  • Define the compliance window (e.g., rolling 7 days or 30 days).
  • Document these in an SLO Agreement between Product and Engineering.

Phase 3: Automation

  • Create SLO Dashboards in your observability tool (e.g., Prometheus, Grafana).
  • Set up Error Budget Alerts (alerting when the budget is burning too fast).
  • Automate the reporting of these metrics into your OKRs.

Phase 4: SRE Maturity

  • Implement Error Budget Policy: Agree on what happens when the budget is zero.
  • SLOs-as-Code: Store your SLO definitions in Git using tools like OpenSLO.
  • Conduct SRE Toil Reviews to periodically automate away manual work discovered via SLO breaches.

Summary Table

Concept Question Example
SLI What do we measure? HTTP Success Rate
SLO What is the goal? 99.9% Success Rate
Error Budget How much can we fail? 0.1% failure allowed

By adopting this framework, organizations shift from "feeling" that the system is broken to "knowing" based on data, and from "guessing" how fast to move to "balancing" based on a formal agreement.