OBSERVABILITY PLATFORM
Service level objectives

Service level objectives

This feature is available only to specific Chronosphere Observability Platform users, and has not been announced or officially released. Do not share or discuss this feature, or information about it, with anyone outside of your organization.

Understanding the performance and reliability of your services is critical to ensuring those services are meeting the needs of your business. As such, communicating expectations for service performance is an important part of many business operations. Additonally, traditional alerting for system health is based on fixed thresholds, which can trigger due to inconsistent data or miss gradual issues that build up over time.

Service level objectives focus on longer-term performance reporting and changes in error rates. These objectives provide a more dynamic incident detection method, letting alerts trigger on changes in real user experience rather than at some arbitrary threshold.

Service performance is generally defined in these ways:

  • Service Level Agreements (SLA): These are contracts between a provider and a client that determine acceptable performance measurements, and the consequences for violating those measurements.
  • Service Level Objectives (SLO): Usually internal, targets for specific metrics the provider aims to meet. These should be as specific as possible and are usually stricter than the SLA. For example, 99.999% uptime or “responds to critical issues in 60 minutes or less.”
  • Service Level Indicators (SLI): The measurement being evaluated in an SLO or SLA. For example, if the SLO is “99.999% uptime,” the SLI is the uptime metric.

The Google SRE handbook (opens in a new tab) explains SLO terminology in detail.

Chronosphere Observability Platform lets you create Service Level Objectives (SLOs). Use these to budget for errors. Chronosphere SLOs use a rolling window to report general status.

To create or update an SLO, see Create an SLO.

View overall SLO status

To view a list of existing SLOs:

In the navigation menu, click Go to Admin and then select System Overview > SLOs.

To filter the SLO list, use the Search SLOs box to search for a specific name, or Select an owner or Select a team from the menus.

The SLO table contains the following information:

  • Status: The overall health of the SLO. Service health is determined by monitor or SLO status. If any monitor is red (breaching), the SLO is also red.

    IconDescription
    Has at least one series with an error rate that exceeds the defined critical conditions.
    Has at least one series with an error rate that exceeds the defined warning conditions.
    No series are alerting.
  • Name: The defined SLO name.

  • Objective: The SLO objective.

  • Owner: The service or collection that owns this SLO.

  • Team: The team responsible for the Owner.

  • Source: The creation method for this SLO.

View a specific SLO

Click the name of any SLO to open its detail page. The details page is similar to a dashboard, and is composed of important metrics related to one or more services.

If a low error rate SLO alert fires the alert can continue to fire for up to the configured long window hours after the resolution of the issue that caused the alert. The time of alert firing depends on the rate of decrease in the error budget.

From an SLO page, these options are available, following the SLO name:

  • Events: Click to open the Display events drawer. Select the checkboxes for the events you want to display, and then click Save.
  • Mute: Click to create a muting rule for this SLO.

Click the three vertical dots to select one of the following options:

  • Duplicate: Click to open the SLO create, populated with the information used to create the existing SLO. Update the new SLO and then click Save.

  • Edit: Click to update your SLO using the create drawer.

  • Version history: Review previous versions of this SLO.

    Click Version history to display a panel with two tabs:

    • Code config: Displays a code representation of the selected entity as of the time of the selected revision.
    • Code diff: Displays a Git-style diff of the most-recent change made to the entity, in Chronosphere API format. To compare the selected revision to another revision in the history, click the Compare With dropdown and select the timestamp of the revision that you want to compare.
      • Click Unified to see the diff stacked horizontally.
      • Click Split to see changes side by side.

    You can see both who made and the method used for the last change at the top of the list of changes.

    To view a revision in the history, click any entry in the list of timestamped revisions. The timestamps default to your local timezone.

    You can view unchanged lines within the diff by clicking the Expand X lines links.

    The Version History view retains up to 500 revisions, or up to 15 months of revisions if there are fewer than 500 revisions.

SLO details

Your SLO details are a high-level view of your overall SLO health. This section indicates whether your SLO is within objective, or breaching its target.

  • Availability target: The SLO definition, defined during creation or editing.
  • Availability: A graph displaying availability results over the selected time period.

Use the Series legend section to search for or select specific series to view. The table has the following fields:

  • Status: The SLO status.
  • Labels and values: Table headers are labels, table values are keys, as listed in labels and annotations.
  • Actual: The actual value of the metric.
  • Error budget: The remaining error budget.

An Error budget is essentially the inverse of an SLO target. If you set an SLO target of 99% availability, you have a 1% error budget. Therefore, if the objective of an SLO changes, so does its error budget.

SLI breakdown

The SLI breakdown section consists of sparkline graphs displaying information based on your service level indicators. These include:

  • Total requests: The total requests made with this SLO identifier.
  • Errors by endpoint: An overview of endpoints, broken out by individual endpoint.

Services

The section following the SLI breakdown displays graphs for the service selected. These graphs vary depending on the service. For example, if the service is RPC (gRPC), this section displays graphs for Requests per second, Errors per second, and Duration P99.

Click See details to view the service details page.

Change events

Change events are required for SLO history.

If this service uses change events, those events are graphed.

SLO information

Update SLO information in the create screen. This section describes the SLO. Runbooks links display here when available.

Related queries depend on features enabled in your tenant. In addition, the SLO must be owned by a service, not a collection. When clicked, the links open in a new tab and prepopulate the page with a query based on the selected SLO. These links include the following:

Ownership

The Ownership section displays the Owner, which is a service or collection that owns this SLO. Policy is the notification policy selected for this SLO.

Labels and annotations

Labels are key-value pairs that filter the SLO to specific telemetry. For example, you might have a service with a label of service and a value of payment-gateway. These values display sequentially.

Annotations are key-value pairs that provide additional information for events.

Service dashboards

Observability Platform generates a list of Service dashboards based on the dashboards attached to the service that owns the SLO.