Service level objectives
This feature isn't available to all Chronosphere Observability Platform users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.
Understanding the performance of your services is critical to ensuring those services remain operational at all times, especially under stress. Communicating expectations for service performance is an important part of many business operations. Traditional alerts are based on fixed thresholds, which can trigger due to inconsistent data or miss gradual issues that build up over time.
Service level objectives focus on longer-term performance and error burn rates. These objectives provide a more dynamic incident detection method, letting alerts trigger before an incident becomes critical, by assessing the impact of the problem.
Service performance is generally defined in these ways:
- Service Level Agreements (SLA): These are standards between a provider and a client that determine acceptable performance measurements, and the consequences for violating those measurements.
- Service Level Objectives (SLO): Agreements for specific metrics the provider must meet. These should be as specific as possible. For example, 99.999% uptime or "responds to critical issues in 60 minutes or less."
- Service Level Indicators (SLI): The actual statistics driving your performance metrics.
The Google SRE handbook (opens in a new tab) explains SLO terminology in detail.
Chronosphere Observability Platform lets you create Service Level Objectives (SLOs). Use these to budget for errors.
To create or update an SLO, see Create an SLO.
View overall SLO status
View a list of existing SLOs:
In the navigation menu, click Go to Admin and then select Lens > SLOs.
To filter the SLO list, use the Search SLOs box to search for a specific name, or Select an owner or Select a team from the menus.
The SLO table contains the following information:
-
Status: The overall health of the SLO.
Icon Description Has a currently alerting monitor that exceeds the defined critical conditions. Has a currently alerting monitor that exceeds the defined warning conditions. No monitors are currently alerting. -
Name: The defined SLO name.
-
Objective: The SLO objective.
-
Owner: The service or collection that owns this SLO.
-
Team: The team responsible for the Owner.
-
Source: The creation method for this SLO.
View a specific SLO
Click the name of any SLO to open its detail page. The details page is similar to a dashboard, and is composed of important metrics related to one or more services.
If a low error rate SLO alert fires the alert can continue to fire for up to six hours after the resolution of the issue that caused the alert. The time of alert firing depends on the rate of decrease in the error budget.
SLO details
Your SLO details are a high-level view of your overall SLO health. This section indicates whether your SLO is within objective, or breaching its target.
- Availability target: The SLO definition, defined during creation or editing.
- Availability: A graph displaying availability results over the selected time period.
Use the Series legend section to search for or select specific series to view. The table has the following fields:
- Status: The SLO status.
- Labels and values: Table headers are labels, table values are keys, as listed in labels and annotations.
- Actual: The actual value of the metric.
- Error budget: The remaining error budget.
An Error budget is essentially the inverse of an SLO target. If you set an SLO target of 99% availability, you have a 1% error budget. Therefore, if the objective of an SLO changes, so does its error budget.
SLI breakdown
The SLI breakdown section consists of sparkline graphs displaying information based on your service level indicators. These include:
- Total requests: The total requests made with this SLO identifier.
- Errors by endpoint: An overview of endpoints, broken out by individual endpoint.
Services
The section following the SLI breakdown displays graphs for the service selected. These graphs vary depending on the service. For example, if the service is RPC (gRPC), this section displays graphs for Requests per second, Errors per second, and Duration P99.
Click See details to view the service details page.
Change events
Change events are required for SLO history.
If this service uses change events, those events are graphed.
SLO information
Update SLO information in the create screen. This section describes the SLO. Runbooks links display here when available.
Related queries depend on features enabled in your tenant. In addition, the SLO must be owned by a service, not a collection. When clicked, the links open in a new tab and prepopulate the page with a query based on the selected SLO. These links include the following:
- View traces: When traces are enabled, this links to Trace Explorer.
- View events: When change events are enabled, this link opens Changes Explorer.
Ownership
The Ownership section displays the Owner, which is a service or collection that owns this SLO. Policy is the notification policy selected for this SLO.
Labels and annotations
Labels are key-value pairs that filter the SLO to specific telemetry. For
example, you might have a service with a label of service
and a value of
payment-gateway
. These values display sequentially.
Annotations are key-value pairs that provide additional information for events.
Service dashboards
Observability Platform generates a list of Service dashboards based on the dashboards attached to the service that owns the SLO.