Observability Platform concepts

Chronosphere Observability Platform includes several distinct components, utilities, and features that give you insight into your telemetry data. This guide describes the most common concepts you encounter while using Observability Platform.

Control Plane

The Control Plane is a processing layer between ingestion and storage that shapes, filters, and routes telemetry data before it counts against your license. Use the Control Plane to control costs and improve the relevance of stored data.

Metrics: Analyze traffic and usage to identify opportunities to reduce the overall volume of metrics and understand the impact of proposed shaping rules.
Traces: Use sampling, datasets, and behaviors to manage the trace data you keep and discard.
Logs: Apply shaping rules to transform, reshape, or exclude log data before it counts against your license.

For all telemetry types, create partitions to attribute costs and usage to appropriate owners in your organization. Define and attach budgets to partitions to safeguard against runaway usage and overspending.

Observe telemetry

Observability Platform combines metrics, traces, and logs in a single platform so you can correlate changes to incidents and monitor service health from one location. Services curate and visualize service-level views from your ingested metrics, traces, and change events. Each service gets a dedicated page with rate, errors, duration (RED) metrics, related monitors, alert statuses, and links to traces and events. Dashboards are a visual representation of your telemetry data that you can customize, filter, and focus on query results to gain deeper context of an issue. Change events overlay deployment and configuration changes on many Chronosphere resources such as dashboards, service pages, and traces to help correlate events with system anomalies during incident investigation.

Investigate monitors and alerts

Monitors define watch criteria that evaluate telemetry data against thresholds. When the criteria are met, the monitor generates an alert, which represents the active state of that condition. Monitors track conditions like capacity, uptime, and error rates, and classify them as passing, warning, or critical. Notification policies route alerts to the appropriate responders. Notifiers define where alerts are delivered, such as email, Slack, PagerDuty, or a webhook. Use signals to group alerts and control how many notifications Observability Platform sends.

Review collections and services

A collection is a group of resources such as dashboards and monitors. A service is a type of collection that represent a logical unit emitting telemetry data, such as a microservice or endpoint. Observability Platform discovers services automatically or through user-defined discovery jobs, and generates a service page with queries, data visualizations, and related monitors for each one.

Define service level objectives

Service level objectives (SLOs) measure longer-term service reliability rather than point-in-time threshold breaches. An SLO defines:

A percentile objective representing your reliability goal, such as 99.95% uptime.
An error budget, or tolerance for downtime. The error budget is the inverse of the objective.
Indicator queries that measure performance against the objective.
A rolling time window over which Observability Platform evaluates performance.

SLOs complement monitors by detecting gradual degradation that fixed-threshold monitors might miss. When the error budget is depleted, the SLO reports the service failed its objective. Use SLO burn rate alerts to notify teams before budget exhaustion, and use differential diagnosis to isolate potential causes.

Investigate and analyze data

Observability Platform provides tools to reduce mean time to repair (MTTR) during incidents:

Differential diagnosis (DDx) identifies the most probable sources of issues by ranking and highlighting suspicious trends in your trace and metric data without requiring manual query construction.
Metrics Explorer, Trace Explorer, and Log Explorer search and visualize metric, trace, and log data to identify patterns and anomalies.
Telemetry Analyzer provides visibility into traffic volume and composition to identify cost reduction opportunities.