OBSERVABILITY PLATFORM
Design SLOs

Design service level objectives

Effective service level objectives (SLOs) can take time and research to create. While Observability Platform’s SLOs are built around industry best practices, you can tailor your SLOs to make them more effective alerting and observation tools for your individual services.

Design user-focused indicators

SLOs should measure availability of your services from your users’ perspective. Design your SLOs to identify when services are falling short of your users’ needs.

Availability isn’t always a binary state of up or down. Define your service level indicators (SLIs) with your users’ experience in mind. Slow responses, non-blocking errors, or unexpected results can represent a lack of service availability from your users’ perspective, even if your service is technically available and responsive.

Observability Platform SLOs are dynamic and can provide multiple error budgets from a single query. Leverage these features when designing your indicators to create low-maintenance SLOs that are also focused on metrics relevant to your users’ experience.

Some services might require tracking multiple SLI definitions, such as tracking both latency and availability. In such situations, each SLI should have its own SLO page with its own burn rate and alerting configuration.

Set a reasonable objective

While it might seem ideal to set a perfect target of 100% availability as your objective, SLOs are most effective when they recognize that issues are inevitable. Instead of aiming for perfection, define your objectives around your users’ tolerance for failures to meet their expectations.

This tolerance is inherently subjective. Beyond minimums set in legal agreements and SLAs, you can iterate on your SLOs based on user feedback and research, your development pace, and your ability to absorb risk.

Likewise, the error budgets created by your objectives help you define the amount of risk you’re willing to accept for a given service. This in turn helps you plan risky actions, such as potentially disruptive deployments, around your users’ tolerance for downtime.

If possible, define your objective based on historical performance to ensure your targets are realistic, and also to minimize on-call burdens on your responders.

Use template variables to reduce query maintenance

If you write your SLO’s query, use template variables to refer to your time window ({{.Window}}), dimensions ({{.GroupBy}}), and label filters ({{.AdditionalFilters}}). These variables automatically align your query to SLO changes, and also help facilitate configuration as code by single-sourcing their definitions.

Tune time window and burn rate definitions

Observability Platform uses opinionated default time windows and multi-window burn rates, all based on industry best practices.

If you intend to change time window and burn rate definitions, ensure that they remain realistic and stay mindful of the alerting noise that might result from changes.

When redefining time windows and burn rates, consider the following:

  • Prioritize alerts: Ensure that alerts are prioritized by severity so that higher burn rates trigger more urgent action than long-term trends or slower burn rates.
  • Mind services’ traffic volume: Services with lower traffic levels might have inconsistent or spiky error rates that cause false positives. Use windows with longer time frames and more conservative burn rates to reduce noise in your alerts.

Design SLOs for rapid response to issues

An SLI measures your service’s error rates across a defined time window to determine whether your service achieves its objective. The SLO also provides tools that help responders protect your service from breaching its objective.

Burn rates measure your error rates in time windows as small as several minutes, rather than days or weeks. Burn rates can fire alerts on the implication that if a high error rate across a short time span continues unabated, then your SLO will breach its objective before the end of its time window.

Observability Platform’s defaults provide multiple burn rates. SLOs provide measurements across multiple windows per burn rate to reduce false positives. By setting burn rate alerts, your SLO can identify and alert responders when a service rapidly experiences more errors or downtime than expected. Your responders can then intervene long before the error budget is exhausted.

Design SLOs for risk management

You can also use SLOs in risk management and planning. Error budgets are designed to be spent, and you can use them to plan upcoming deployments that you know might deplete them.

For example, downtime from planned deployments and maintenance activities are part of your error budget, and burn rate alerting can help you identify and react when such planned actions have unexpected user-facing results.

Consider your error budget separately from your SLO objective. If you set a 99% objective, consider your 1% error budget as its own amount of capacity that you can spend on risky deployment or maintenance actions. Burn rates measure consumption of your error budget rather than your total objective because they extrapolate how much capacity you can sacrifice before your service breaches its objective.

Burn rate alerts help responders react to issues as they happen, and also help identify how much downtime your users can tolerate for the rest of your time window.

An incident with a high burn rate leaves less error budget for the rest of your time window, which affects how you allocate the remainder. Conversely, reducing the downtime of risky actions gives you more budget to work with for more frequent or riskier actions within your time window.

Use burn rate alerts to also alert stakeholders who determine deployment schedules, and use visualizations on an SLO’s page to find historical context when planning deployments for future time windows.

Create effective SLO alerts

For managing and responding to degraded service performance and outages, SLOs provide significant benefits compared to other alerting practices:

  • User-centric measurement: SLOs focus on visualizing and reporting on symptoms rather than causes, which concentrates coverage on issues actively affecting your services and reduces false positives.
  • Standardized operational practices: The standardized features and presentation of SLOs facilitate normalized alerts, dashboards, and operational reviews across your organization to improve consistency in team transitions and on-call rotations.
  • Data-driven decision making: By measuring error budgets against availability targets, SLOs provide objective data toward balancing investments in a service’s reliability against new feature development. This allows for more consistent risk management while you iterate on the service’s implementation.

While defining your SLO, use the SLO tab in the SLO preview drawer to simulate alerts. This tab uses real data to project where your SLO would have fired alerts, and you can update those simulations after tuning your objective and burn rates.

Avoid high-impact alerts on new SLOs

New SLOs often require some iteration and tuning to become effective alerting tools. The best-designed objectives and alerts can still result in alerts firing too quickly or too often.

For new SLOs, create alerts with a trial period of a few weeks. Use lower-impact notification policies during this period to avoid recurring alerts, and use this period to tune your SLO’s objective, burn rates, and alerting settings.

Once you’ve ensured that the SLO alerts your responders only when necessary, switch your SLO to a higher-impact notification policy.

Use SLOs with other Observability Platform features

In addition to alerts, Observability Platform SLO integrate with other features that help you identify, analyze, and investigate issues.

  • Use Differential Diagnosis (DDx) for metrics from SLO visualization panels to help identify the source of spikes or other unusual shapes.
  • Connect SLOs to services, which includes the SLO’s status with other monitors when depicting the service’s health. This can draw responders’ attention to SLOs when viewing a service page.

Further reading

SLOs are a complex subject, and resources from across the observability industry can help you better understand them and improve your SLO designs.