- A percentile objective that represents your goal for uptime or error-free operation. For example, your objective for a service might be to maintain 99.95% uptime.
- An error budget, or your tolerance for downtime or errors. This is the inverse of your objective, because it represents the service capacity that can be lost before the service fails its objective. Likewise, changes to your objective also change your error budget. For example, a 99.95% uptime objective also defines a 0.05% error budget.
- Metrics queries that define indicators of your performance against that objective. For example, you might query for the summed duration of error responses that your service returned to requests, and compare that to the total time the service was running. This creates an error ratio, or the percentage of errors against your total.
- A time window that you’ll measure for performance against your objective. Your SLO measures your error or success rates against the total over the time window, such as the last four weeks, to determine whether the service met the objective.
Differences from monitors
Although SLOs seem similar to monitors, SLOs provide a more dynamic incident detection method that let you trigger alerts based on changes in real user experiences, rather than at an arbitrary threshold. SLOs also provide additional details for more granular notifications:- SLOs report the burn rate of your error budget, which you can configure to raise alerts when the service is depleting its error budget over a short period within your time window. Burn rate alerts can help you respond to sudden degradations of performance before they breach your SLO, and burn rate visualizations can identify patterns in error rates that might not be as evident when looking only at the service’s metrics. For example, a burn rate alert can trigger notifications if more than 2% of your error budget is consumed over a one-hour span. You can then respond closer to the beginning of the incident and attempt to prevent the SLO from breaching by investigating the problem and finding a solution. Such a spike in burn rate will also be displayed on the SLO’s burn rate chart, which can help you pinpoint when the service degradation started.
- You can define label-based dimensions to break down your SLO’s measurement by time series. This helps you respond to complex services represented by multiple time series by letting you signal for specific series that breach the SLO.
- You can perform differential diagnosis (DDx) on your SLO’s charts to begin correlating concerning patterns in error rates.
SLO terminology
Service performance is generally defined in these ways:- Service level agreements (SLA): Contracts between a provider and a client that determine acceptable performance measurements, and the consequences for violating those measurements. An SLA defines the limits and consequences for failures.
- Service level objectives (SLO): Usually internal targets for specific metrics that the provider aims to meet. These should be as specific as possible and are usually stricter than the SLA. For example, to ensure you meet an SLA to maintain 99.95% uptime or respond to an incident in less than two hours, you might define your SLO as meeting a standard of 99.999% uptime or responding to an incident within 60 minutes.
- Service level indicators (SLI): The measurement being evaluated in an SLO or SLA, often as service uptime, availability, or response success rate. For example, if a SLO is to maintain 99.999% service uptime, the SLI is the service’s uptime metric.
View overall SLO status
You can view a list of SLOs to identify if any are breaching their limits. You can also filter the list to narrow your view to specific keywords, team, or owner. To view a list of existing SLOs:- In the navigation menu, click Go to Admin.
- Select System Overview > SLOs.
- Enter text into the Search SLOs search field to filter by name
- Use the Select an owner dropdown to filter by the SLO’s owning collection or service
- Use the Select a team dropdown to filter by the SLO’s assigned team.
-
Status: The SLO’s health, summarized as a status icon. An SLO’s health is
defined by the alerting status of its monitor or the state of the series measured
by the SLO as compared to its error budget.
Icon Description Has at least one series with an error rate that exceeds the defined critical conditions. Has at least one series with an error rate that exceeds the defined warning conditions. No series are alerting. Alerting is muted. Alerting is disabled. No data is available. - Name: The SLO’s name.
- Objective: The objective defined for this SLO.
- Alerting Enabled: Whether or not alerting is enabled for this SLO.
- Owner: The service or collection that owns this SLO.
- Team: The team responsible for the Owner.
-
Source: This SLO’s creation method.
Users can modify Terraform-managed resources only by using Terraform. Learn more.
- Duplicate: Click to open the SLO create drawer, populated with the information used to create the existing SLO. Configure the new SLO and then click Save to create the new SLO.
- Edit: Click to update your SLO using the create drawer.
- Delete: Delete the SLO.
View an SLO
- Web
- API
Click the name of any SLO in the list to open its page, which is similar to a dashboard
and visualizes important metrics related to one or more services.If this service uses change events, those
events are graphed in this section. This includes events generated by this SLO and
also events added by other features to connected services.
SLO menu
An SLO page’s menu provides access to features that modify the SLO’s behavior:- Events: Click to open the Display events drawer. Select the checkboxes for the events you want to display, and then click Save.
- Mute: Click to create a muting rule for this SLO. If a muting rule is already active for an SLO, a banner indicates the active muting rule and its expiration.
- Duplicate: Click to open the SLO create drawer, populated with the information used to create the existing SLO. Configure the new SLO and then click Save to create the new SLO.
- Edit: Click to update your SLO using the create drawer.
-
Version history: Review previous versions of this SLO’s configuration.
Click Version history to display a panel with two tabs:
- Code config: Displays a code representation of the selected entity as of the time of the selected revision.
- Code diff: Displays a Git-style diff of the most-recent change made to the
entity, in Chronosphere API format. To compare the selected revision to another
revision in the history, click the Compare With dropdown and select the
timestamp of the revision that you want to compare.
- Click Unified to see the diff stacked horizontally.
- Click Split to see changes side by side.
The Version History view retains up to 500 revisions, or up to 15 months of revisions if there are fewer than 500 revisions.
SLO details
The SLO details section provides a high-level view of the SLO’s overall health, and indicates whether your SLO is meeting its objective or has breached its target.- Alerting status: The SLO’s status.
- Availability target: The SLO’s currently defined objective.
- Reporting status: If the SLO is firing alerts, or if its error budgets are depleted or low, Observability Platform displays additional indicators to summarize these major issues.
- Availability: Availability results based on the SLI’s rate definition.
- Error budget: The SLO’s remaining error budget over its defined time window.
Reporting status
If an SLO is breached or close to being breached, the SLO page displays a Reporting status that’s otherwise hidden from view. This status contains chips for the SLO’s firing alerts, depleted error budgets, and error budgets that are close to depletion.If the reporting status is visible for an SLO, you should immediately begin investigating the causes for the statuses it reports.SLO alerting
If a low-error-rate SLO alert fires, the alert can continue to fire for up to the configured long window for hours after the resolution of the issue that caused the alert. The time of alert firing depends on the rate of decrease in the error budget.Series
Use the Series subsection’s table to search for or select specific series to view in the Availability and Error budget charts. Each row represents a time series returned in the SLI’s query.The table has the following columns:- Status: The SLO status for that series.
- Columns for labels and values: Each column’s header is the name of a label in that series, and its cells contain that label’s value for that row’s series.
- Actual: The metric’s value over the SLO’s defined time window.
- Error budget: The SLO’s remaining error budget for that series. If the cell’s background is red, its value represents a breach of the SLO.
SLI breakdown
The SLI breakdown section consists of charts that visualize your service level indicators, which are based on the SLO’s definition. For more information, see Create service level objectives.These charts include:- Total requests: A visualization of the SLI’s total query, representing the total requests to the service.
- Errors: A visualization of the SLI’s error query.
Burn and error rates
The Burn/Error rates section consists of charts that visualize the error budget burn rate and the rate of reported errors. Burn rate calculations are based on the SLO’s definition. For more details, see Create service level objectives.You can adjust the window used for visualizations, which can be1h, 6h,
1d, or 3d.Change events
This feature isn’t available to all Chronosphere Observability Platform users and
might not be visible in your app. For information about enabling this feature in your
environment, contact Chronosphere Support.
Change events are required for SLO history.
SLO information
The SLO information section provides a user-defined Description of the SLO and relevant Runbook links, as defined in the create drawer.Related queries depend on features enabled in your tenant. In addition, the SLO must be owned by a service, not a collection. When clicked, the links open in a new tab and populate the page with a query based on the selected SLO. These links include the following:- View traces: When traces are enabled, this links to Trace Explorer.
- View events: When change events are enabled, this link opens Changes Explorer.
Ownership
The Ownership section displays the SLO’s Owner, which is a service or collection. Its Policy links to the SLO’s selected notification policy.Labels and annotations
Labels are key-value pairs that filter the SLO to specific telemetry. For example, you might have a service with a label ofservice and a value of
payment-gateway. These values display sequentially.Annotations are key-value pairs that provide additional information for events.