OBSERVABILITY PLATFORM
Observe

Create service level objectives

This feature is available only to specific Chronosphere Observability Platform users, and has not been announced or officially released. Do not share or discuss this feature, or information about it, with anyone outside of your organization.

Create a service level objective (SLO) in Chronosphere Observability Platform.

When defining windows and burn rates, consider the following topics:

  • Prioritize alerts: Ensure that alerts are prioritized by severity so that higher burn rates trigger more urgent action than long-term trends or slower burn rates.
  • Traffic volume: Services with lower traffic levels might have inconsistent or spiky error rates, causing false positives. Use windows with longer time frames and more conservative burn rates to reduce noisy alerts.
  • Multiple Service Level Indicators (SLI): Some services might require tracking multiple SLI definitions (for example, tracking latency and availability). Each SLI should have its own SLO with associated burn rate and alert configuration.
  • Historical Data: If possible, define your objective based on historical performance to ensure your targets are realistic and minimize oncall burden.

Create a new SLO

To create a new SLO:

  1. In the navigation menu, click Go to Admin and then select System Overview > SLOs.

  2. On the SLO page, click Create SLO.

  3. In the SLO information section, complete these fields:

    • Name: The SLO name.
    • Owner: The service or collection that owns this SLO.
    • Description: User-defined text about this SLO. Use the description to tell other users what this SLO is for and which downstream users or systems may be impacted.
    • Runbooks: A name and URL for any runbooks used when this SLO triggers. This becomes a link on the SLO page.
  4. Add Alerting.

    Alerting is enabled by default. Toggle Alerting enabled to not alert on this SLO.

    • Select a Notification policy.

      • When using the Default Policy, this section displays the policy defined for the selected Owner.
      • When using Select Policy, you can choose a different policy than the default. The policy details display beneath the menu.
    • Add the Burn rate alert configuration. Burn rate configuration adds criteria which alerts trigger for. For example, the default burn rate definition includes:

      When 2 or more % consumed from your error budget over the last 1h (one hour) Long window and the error rate is still high over the last 5m (five minute) Short window, a critical Severity alert fires. When the problem now longer exists over the last 5m (five minute), the alert resolves. For a full explanation, see Multiwindow, Multi-Burn-Rate Alerts (opens in a new tab).

      You can add an optional Notification label to an alert.

  5. Create the SLO definition:

    • Add a percentile-based value for the Objective.
    • Add a Reporting window, which is length of time for your SLO report. The default value is 4w.
  6. Add a PromQL query:

    1. Choose a Query type:

      • Error
      • Success
    2. Enter an Error query and Total query.

      The following template variables are available. While optional, Chronosphere strongly suggests using these when applicable:

      • {{.Window}}: Use this variable in place of the time interval to dynamically assign the time interval value on the SLO details page. This placeholder is used to compute the optimal window size to fulfill this SLO based on the input reporting windows and burn rates. In practice, using the default values, it normally resolves to 5m.

        If your query has a rate you should be using {​{​.Window}}. Gauges can’t use {​{​.Window}}

      • {{.GroupBy}}: Use this variable in place of group by statements when you want a column for each label name defined in the dimensions section. This placeholder substitutes all the unique values in dimensions and signal groupings as a comma-separated list. It provides a place that defines the unique values and reduces mismatched queries. Observability Platform doesn’t block you from managing the two lists without {{.GroupBy}}, but the lists should be identical in the errors and totals queries. Those lists should also match the lists in dimensions and signal groupings.

        If your query has a by (...) clause, you should use by ({​{​.GroupBy}}).

      • {​{​.AdditionalFilters}}: Use this variable in place of long lists of selectors in your SLO queries. This placeholder substitutes all the filters added in the Additional filters section. This allows both sharing a single list of filters for both queries if the list is long. {​{​.AdditionalFilters}} can also be useful when templating SLOs in configuration as code, because you can provide different values based on inputs without having to manipulate the query directly.

        Observability Platform doesn’t block you from managing the two lists of selectors in your PromQL queries. However, if additional filters are added to the **Additional filters section, it’s expected that the variable will be used at least once.

        If your queries have a metric{...} where ... is identical, consider using metric{{.AdditionalFilters}}}.

    For example:

      sum by ({{.GroupBy}})(rate(metric[{{.Window}}])

    When cluster and namespace are used as dimensions, the effective query is:

    sum by (cluster, namespace)(rate(metric[5m])
  7. Refine your query using Dimensions, signals, and filters. Dimensions are used to generate a time series per combination of labels entered.

    • Toggle Alert by series to create alerts for each time series in the selected metric. Select the Use as signal checkbox to create a signal.
    • Add a Label name.
    • Add Label filters to reduce the number of metrics used by the SLO.

    The signal indicates which labels to alert on. For example, if the base query is sum by (cluster) (rate(metric_name{})), you can add dimensions to make the effective query sum by (cluster, namespace, instance) (rate(metric_name{})) but only have cluster and namespace added as signals to get an alert for each cluster and namespace combination.

  8. Add any Labels and annotations, such as:

  9. Use the SLO preview section to review graphs for your queries and ensure the SLO definition meets the specifications you want.

    • The SLI tab displays graphs Total requests and Errors over the selected time period.

    • The SLO tab shows service availability over the selected time period.

      Toggle Simulate alerts to backtest your condition against existing data. Any alerts that would have fired will show on the graph. The preview reflects existing signal grouping, dimensions, and burn rate configuration.

      Use the Show alert durations toggle to display the time period over which the alert would have been active.

  10. Click Save.

Edit or delete an SLO

In the navigation menu, click Go to Admin and then select System Overview > SLOs.

Click the three vertical dots and then select Edit to change or Delete to remove the SLO.

From an SLO page, click Edit and then Delete SLO.