OBSERVABILITY PLATFORM
Service level objectives

Service level objectives

This feature is available only to specific Chronosphere Observability Platform users, and has not been announced or officially released. Do not share or discuss this feature, or information about it, with anyone outside of your organization.

It’s critical to understand your services’ performance and reliability to ensure those services are meeting your business’s needs. To do so, you can define and communicate your expectations for a service’s performance.

In many cases, it’s sufficient to define monitors that measure metrics about a service against a fixed threshold, and to alert those responsible for the service when that threshold is exceeded. However, in complex and dynamic systems monitors can also be falsely triggered by inconsistent data or fail to alert on gradual issues that build up over time.

Service level objectives (SLOs) instead focus on longer-term performance reporting and changes in error rates. In an SLO, you define these elements:

  • A percentile objective that represents your goal for uptime or error-free operation. For example, your objective for a service might be to maintain 99.95% uptime.
  • An error budget, or your tolerance for downtime or errors. This is the inverse of your objective, because it represents the service capacity that can be lost before the service fails its objective. Likewise, changes to your objective also change your error budget. For example, a 99.95% uptime objective also defines a 0.05% error budget.
  • Metrics queries that define your performance against that objective. For example, you might query for the summed duration of error responses that your service returned to requests, and compare that to the total time the service was running.
  • A time window that you’ll measure for performance against your objective. Your SLO measures your error or success rates against the total over the time window, such as the last four weeks, to determine whether the service met the objective.

Chronosphere Observability Platform lets you create SLOs you can use to define objectives, set queries, and measure and visualize performance over a time window. Chronosphere SLOs use a rolling time window to report a service’s status.

After you’ve defined your SLO, it then identifies your error or success rate against the total over the time window. If the service met your objective in the time window, the SLO reports the service is healthy. If not, the SLO reports its objective was breached and visualizes the state of metrics before the breach for investigation.

Differences from monitors

Although SLOs seem similar to monitors, SLOs provide a more dynamic incident detection method that let you trigger alerts based on changes in real user experiences, rather than at an arbitrary threshold.

SLOs also provide additional details for more granular notifications:

  • SLOs report the burn rate of your error budget, which you can configure to raise alerts when the service is depleting its error budget over a short period within your time window. Burn rate alerts can help you respond to sudden degradations of performance before they breach your SLO, and burn rate visualizations can identify patterns in error rates that might not be as evident when looking only at the service’s metrics.

    For example, a burn rate alert can trigger notifications if more than 2% of your error budget is consumed over a one-hour span. You can then respond closer to the beginning of the incident and attempt to prevent the SLO from breaching by investigating the problem and finding a solution. Such a spike in burn rate will also be displayed on the SLO’s burn rate chart, which can help you pinpoint when the service degradation started.

  • You can define label-based dimensions to break down your SLO’s measurement by time series. This helps you respond to complex services represented by multiple time series by letting you signal for specific series that breach the SLO.

  • You can perform differential diagnosis (DDx) on your SLO’s charts to begin correlating concerning patterns in error rates.

SLO terminology

Service performance is generally defined in these ways:

  • Service level agreements (SLA): Contracts between a provider and a client that determine acceptable performance measurements, and the consequences for violating those measurements. An SLA defines the limits and consequences for failures.
  • Service level objectives (SLO): Usually internal targets for specific metrics that the provider aims to meet. These should be as specific as possible and are usually stricter than the SLA. For example, to ensure you meet an SLA to maintain 99.95% uptime or respond to an incident in less than two hours, you might define your SLO as meeting a standard of 99.999% uptime or responding to an incident within 60 minutes.
  • Service level indicators (SLI): The measurement being evaluated in an SLO or SLA, often as service uptime, availability, or response success rate. For example, if a SLO is to maintain 99.999% service uptime, the SLI is the service’s uptime metric.

For more information about industry-standard SLO terminology, see the Google SRE handbook (opens in a new tab).

SLO management

To create, update, or delete an SLO, see Create an SLO.

View overall SLO status

You can view a list of SLOs to identify if any are breaching their limits. You can also filter the list to narrow your view to specific keywords, team, or owner.

To view a list of existing SLOs:

  1. In the navigation menu, click Go to Admin.
  2. Select System Overview > SLOs.

To filter the SLO list, use one of these methods:

  • Enter text into the Search SLOs search field to filter by name
  • Use the Select an owner dropdown to filter by the SLO’s owning collection or service
  • Use the Select a team dropdown to filter by the SLO’s assigned team.

The SLO table contains the following information:

  • Status: The SLO’s health, summarized as a status icon. An SLO’s health is defined by the alerting status of its monitor or the state of the series measured by the SLO as compared to its error budget.

    IconDescription
    Has at least one series with an error rate that exceeds the defined critical conditions.
    Has at least one series with an error rate that exceeds the defined warning conditions.
    No series are alerting.
    Alerting is muted.
    Alerting is disabled.
    No data is available.
  • Name: The SLO’s name.

  • Objective: The objective defined for this SLO.

  • Alerting Enabled: Whether or not alerting is enabled for this SLO.

  • Owner: The service or collection that owns this SLO.

  • Team: The team responsible for the Owner.

  • Source: This SLO’s creation method.

    Users cannot modify Terraform-managed resources in the user interface, with Chronoctl, or by using the API. Learn more.

The row for each SLO in the list also includes a three vertical dots icon that provides quick access to SLO creation, editing, and deletion. Click the icon to select one of the following options:

  • Duplicate: Click to open the SLO create drawer, populated with the information used to create the existing SLO. Configure the new SLO and then click Save to create the new SLO.
  • Edit: Click to update your SLO using the create drawer.
  • Delete: Delete the SLO.

View an SLO

Click the name of any SLO in the list to open its page, which is similar to a dashboard and visualizes important metrics related to one or more services.

SLO menu

An SLO page’s menu provides access to features that modify the SLO’s behavior:

  • Events: Click to open the Display events drawer. Select the checkboxes for the events you want to display, and then click Save.
  • Mute: Click to create a muting rule for this SLO. If a muting rule is already active for an SLO, a banner indicates the active muting rule and its expiration.

The menu also includes a link to the documentation and a three vertical dots icon.

Click the three vertical dots to select one of the following options:

  • Duplicate: Click to open the SLO create drawer, populated with the information used to create the existing SLO. Configure the new SLO and then click Save to create the new SLO.

  • Edit: Click to update your SLO using the create drawer.

  • Version history: Review previous versions of this SLO’s configuration.

    Click Version history to display a panel with two tabs:

    • Code config: Displays a code representation of the selected entity as of the time of the selected revision.
    • Code diff: Displays a Git-style diff of the most-recent change made to the entity, in Chronosphere API format. To compare the selected revision to another revision in the history, click the Compare With dropdown and select the timestamp of the revision that you want to compare.
      • Click Unified to see the diff stacked horizontally.
      • Click Split to see changes side by side.

    You can see both who made and the method used for the last change at the top of the list of changes.

    To view a revision in the history, click any entry in the list of timestamped revisions. The timestamps default to your local timezone.

    You can view unchanged lines within the diff by clicking the Expand X lines links.

    The Version History view retains up to 500 revisions, or up to 15 months of revisions if there are fewer than 500 revisions.

The menu’s time range selector displays the current time range applied to the SLO’s visualizations and lets you define a new time range. You can also select the SLO’s time range by clicking and dragging across the time span you want to define in any of the SLO’s visualization charts.

SLO details

The SLO details section provides a high-level view of the SLO’s overall health, and indicates whether your SLO is meeting its objective or has breached its target.

  • Alerting status: The SLO’s status.
  • Availability target: The SLO’s currently defined objective.
  • **Reporting status: If the SLO is firing alerts, or if its error budgets are depleted or low, Observability Platform displays additional indicators to summarize these major issues.

The following charts visualize performance against the SLO’s defined limits:

  • Availability: Availability results based on the SLI’s rate definition.
  • Error budget: The SLO’s remaining error budget over its defined time window.

As with all charts on the SLO view, hovering over it reveals three vertical dots. When clicked, this provides options to open the chart’s query in Metrics Explorer, add the chart to a dashboard, or investigate it using Metrics DDx.

Reporting status

If an SLO is breached or close to being breached, the SLO page displays a Reporting status that’s otherwise hidden from view. This status contains chips for the SLO’s firing alerts, depleted error budgets, and error budgets that are close to depletion.

If the reporting status is visible on an SLO, you should immediately begin investigating the causes for the statuses it reports.

SLO alerting

If a low-error-rate SLO alert fires, the alert can continue to fire for up to the configured long window for hours after the resolution of the issue that caused the alert. The time of alert firing depends on the rate of decrease in the error budget.

Series

Use the Series subsection’s table to search for or select specific series to view in the Availability and Error budget charts. Each row represents a time series returned in the SLI’s query.

The table has the following columns:

  • Status: The SLO status for that series.
  • Columns for labels and values: Each column’s header is the name of a label in that series, and its cells contain that label’s value for that row’s series.
  • Actual: The metric’s value over the SLO’s defined time window.
  • Error budget: The SLO’s remaining error budget for that series. If the cell’s background is red, its value represents a breach of the SLO.

SLI breakdown

The SLI breakdown section consists of charts that visualize your service level indicators, which are based on the SLO’s definition. For more information, see Create service level objectives.

These charts include:

  • Total requests: A visualization of the SLI’s total query, representing the total requests to the service.
  • Errors by endpoint: A visualization of the SLI’s error or success query, broken out by individual endpoints.

Burn and error rates

The Burn/Error rates section consists of charts that visualize the error budget burn rate and the rate of reported errors. Burn rate calculations are based on the SLO’s definition. For more details, see Create service level objectives.

You can adjust the window used for visualizations, which can be 1h, 6h, 1d, or 3d.

Change events

This feature isn’t available to all Chronosphere Observability Platform users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.

Change events are required for SLO history.

If this service uses change events, those events are graphed in this section. This includes events generated by this SLO and also events added by other features to connected services.

SLO information

The SLO information section provides a user-defined Description of the SLO and relevant Runbook links, as defined in the create drawer.

Related queries depend on features enabled in your tenant. In addition, the SLO must be owned by a service, not a collection. When clicked, the links open in a new tab and populate the page with a query based on the selected SLO. These links include the following:

Ownership

The Ownership section displays the SLO’s Owner, which is a service or collection. Its Policy links to the SLO’s selected notification policy.

Labels and annotations

Labels are key-value pairs that filter the SLO to specific telemetry. For example, you might have a service with a label of service and a value of payment-gateway. These values display sequentially.

Annotations are key-value pairs that provide additional information for events.

Service dashboards

Observability Platform generates a list of Service dashboards based on the dashboards attached to the service that owns the SLO.