Respond to incidents - Chronosphere Documentation

When an alert triggers, time matters. This guide walks through the end-to-end workflow for on-call engineers: from receiving a notification, to triaging scope and impact, to investigating root causes across telemetry types, to silencing noise and documenting resolution.

Step 1: Receive and review the alert

Chronosphere Observability Platform routes alert notifications through the channels your team has configured, such as PagerDuty, OpsGenie, Slack, email, or a webhook. When a notification arrives, it links back to the triggering alert in Observability Platform. If you’re already in the product, view your personal home page for recently triggered alerts, recent comments from teammates, and activity on resources you’ve added as favorites. Global search (Control+K or Command+K on macOS) also jumps directly to a monitor, dashboard, or service by name. To find active alerts directly in the product:

In the navigation menu, select Alerting > Alerts.
Review the Status column to distinguish Critical, Warning, Muted, and Resolved alerts.
Click an alert’s title to open its alert details page.

The alert details page displays the following information:

The alert’s current status and the triggering signals
A time series chart of the query that triggered the alert, with optional threshold overlays
The conditions that triggered, including operator, threshold, and sustain duration
Any change events that correlate with the alert window
The alert’s owner collection and responsible team
Annotations defined on the monitor, such as runbook links and related dashboard links

Annotations are the primary way to surface context for incident response, such as runbooks, related dashboards, data links to traces, and contact routing. If these links are present, start there. The people who built the monitor placed them to accelerate exactly this kind of triage.

Step 2: Assess severity and scope

Before running queries, establish the scope to understand the surface area you’re dealing with.

Identify the affected service and team

The Alert information sidebar on the alert details page lists the owning collection and team responsible for the alert’s source. If the alert is about a service you don’t own, use this to identify who to contact. Click Return to source monitor (or Return to source SLO) to open the entity that defined the alert. From there, review the full monitor configuration, its notification policy, and its signal grouping.

Understand blast radius with the dependency map

If the alert originates from a known service, open the service page to see upstream and downstream dependencies:

In the navigation menu, select Go to Admin > System Overview > Services, then click the service name.
In the Dependency map, examine the arrows on edges between services. Single, double, and triple arrows indicate increasing trend magnitude for duration, requests, and errors.
Click a service in the map to see per-service totals for requests, errors, leaf errors, and latency percentiles.

The dependency map shows upstream and downstream relationships without requiring prior knowledge of the topology. Return to Alerting >Alerts and scan the list for other alerts that triggered in the same time window. Alerts occurring across multiple services suggest a broad infrastructure or network issue rather than a code-level regression in a single service. Adjust the time range on the alert details page to widen or narrow the window and confirm whether the alert is ongoing or already resolving.

Step 3: Investigate root cause

Finding the root cause of an issue can be complicated, so Observability Platform provides numerous tools to help target the problem and resolve it.

Start with differential diagnosis on metrics

Differential diagnosis (DDx) scans all label-value combinations for a metric to find which ones are statistically correlated with the anomaly. Use DDx to convert a noisy alert signal into a specific hypothesis without manually comparing dozens of series. From an alert details page:

Click DDx in the alert’s query visualization section.
Review which label-value combinations show the highest divergence from baseline.
Use those labels as starting filters for deeper investigation in Metrics Explorer.

From a service dependency map:

Click a service node or edge, then click the three dots icon next to the trend statistic you want to investigate.
Select Differential Diagnosis.

Apply differential diagnosis to traces

If the alert involves latency or error-rate spikes and your service emits traces, run DDx in Trace Explorer. DDx scans all tag:value pairs across spans to identify which combinations are unusually correlated with errors or high latency, such as a specific deployment.version or host.region.

In Trace Explorer, define a search that captures the problematic traffic.
Click the Differential Diagnosis tab.
Compare current data against a baseline period immediately before the incident to distinguish newly correlated tags from those that are always present.

Use Metrics Explorer to drill into queries

Click Open in explorer from the alert details page to open Metrics Explorer pre-populated with the triggering query. From there:

Adjust label filters to isolate specific hosts, pods, or regions
Use the Query Builder to modify the query without writing PromQL from scratch
Toggle the time range to compare current behavior to a prior window

If the monitor’s annotations link to a dashboard, open it for a pre-built operational view of the affected service. Dashboards combine multiple panels, including time series, gauges, tables, and service topology, that the team curated for exactly this kind of investigation. Dashboards are also available through:

The service page’s connected resources
Your personal home page, which lists recent and favorite dashboards
Global search (Control+K or Command+K on macOS), filtered to Dashboards

From any dashboard panel, use the three-dots menu to open the query in Metrics Explorer for further analysis, or add the panel to a notebook to preserve it as evidence. For more information, see Dashboards.

Compare to a previous time period

Use the Compare option in the time range selector to overlay data from a prior period, for example, the same time window one week ago. This reveals whether the current behavior is anomalous or matches a recurring pattern, such as a weekly traffic cycle. On service pages, drag a region in any chart to synchronize all panels to that time window, making cross-signal correlation faster.

Correlate across telemetry types

Observability Platform connects metrics, traces, and logs. To carry context forward when moving between them:

From a service page, click Explore trace data in the dependency map to open Trace Explorer pre-scoped to that service.
From Trace Explorer, identify leaf error spans. These are errors with no failing child span, which typically indicate the actual source of a failure rather than a propagated error.
Use pinned scopes to carry label filters such as environment=production across Metrics Explorer, service pages, and dashboards without re-entering them on each page.

If your service page includes a Logs link, open Logs Explorer scoped to the same service. This approach is effective when the metric alert is caused by an app-level error that only surfaces in log output.

Check change events for correlated deployments

Change events appear as vertical markers on time series charts across service pages, dashboards, and Metrics Explorer. Look for markers that align with the start of the anomaly. A deployment, config change, or feature flag flip often correlates directly with the alert. To search across all recent changes, open Changes Explorer and filter by time range, service, or event source. Change events enabled on dashboards and service pages also display inline without additional configuration.

Generate queries with natural language

If you’re unfamiliar with the affected service’s metrics or don’t know the exact PromQL or log query syntax, use natural language queries to describe what you’re looking for. Click Edit with AI in Metrics Explorer or Logs Explorer (or press Control+I / Command+I on macOS) and enter a prompt such as:

error rate for checkout service in us-east-1 last hour

Observability Platform generates a query using semantic metric search to match relevant metrics by intent rather than exact name.

Step 4: Reduce noise while you work

Alerts that are triggering while you investigate can generate repeated notifications. Muting the right alerts without over-muting lets you work without distraction.

Create a targeted muting rule

From the alert details page, click Mute alert to open a muting rule pre-populated with the alert’s source and name. This approach is the safest way to mute, as the scope derives from the specific alert rather than relying on manual entry. Before saving, the Preview Alerts panel shows exactly which alerts the rule will silence. Review this list to confirm you’re not accidentally muting alerts from unrelated services. To set a narrow scope, choose A Monitor rather than Time Series unless you have a specific reason to mute based on label matching. Set a duration that matches your expected resolution window. You can always extend or expire the rule early. For more granular control, such as muting only a specific pod or environment, use Time Series and define the label matcher. Use regular expression matching when you need to mute a pattern of related series. After the incident resolves, expire the rule immediately from Alerting > Muting Rules rather than letting it run to its scheduled end time.

Verify notification routing

If you’re unsure whether notifications are reaching the right people, open the source monitor and click Test notifications to send a synthetic alert through the same routing logic as a real alert. This confirms that notifiers and templates are working as expected without waiting for another real event to fire.

Step 5: Collaborate and resolve

Use notebooks, comments, and resolution notes to share investigation context with other responders and create a permanent record of what happened and how you fixed it.

Gather evidence in a notebook

As you investigate, add charts, query results, and dashboard panels to a notebook to build a running evidence file. Click the Notebook icon in the page header or right sidebar to open one. From any panel or explorer result, click the three-dots menu and select Add to notebook, or drag panels directly from a dashboard into the open notebook. Notebooks support:

Panels from dashboards, Metrics Explorer, Logs Explorer, DDx, alert details, SLO charts, service pages, monitors, and the metrics catalog
A dedicated Add to notebook action in Logs Explorer that adds both a log volume histogram and the current results visualization
Copying a URL from any resource page and pasting it into the notebook to embed a link card
Per-panel time range overrides, so you can compare the same chart at different points in time
Snapshots to freeze a panel’s data for long-term reference
Panel editing to refine queries without leaving your evidence file
Version history to review or restore earlier states of the notebook

Share a notebook by clicking Copy URL to give other responders a single link to your full investigation context. If another responder edits a shared notebook, Observability Platform prompts you to reload so both sides stay in sync.

Add comments for active collaboration

Comments let you annotate monitors, metrics, and change events with notes visible to anyone viewing that resource. From the monitor’s page, click + Add comment to leave notes for other responders working the same incident. Comments also persist on metrics in Metrics Explorer and on change events in Changes Explorer. Use them to record institutional knowledge such as “this metric spikes every Tuesday during batch processing” so the next responder doesn’t retrace the same investigation.

Document the resolution

When the alert resolves, add a resolution note on the alert details page. Resolution notes:

Accept Markdown, so you can include code snippets, links, and structured text
Associate with specific signals so notes for different affected services stay organized
Persist after the alert closes, creating a searchable record for post-incident reviews

To add a resolution note:

On the alert details page, click + Add in the Resolution notes section.
Enter a description of what you found and what action you took.
Click Create.

If a deployment or configuration change correlates with the alert, create a change event from the monitor page. Change events appear in time series charts across Observability Platform, making the correlation visible to anyone who views that time window in the future.

What to do after the incident

The incident response workflow doesn’t end at resolution. After closing the incident, consider taking the following actions to help other users remediate future issues with less investigation:

Improve monitor signal quality: if the alert triggered but required significant manual triage, edit the monitor’s annotations to add runbook links, dashboard links, or data links to relevant traces for the next responder.
Adjust thresholds: if the alert was noisy, triggering frequently on non-issues, review the monitor’s conditions and sustain duration to better reflect real risk.
Review the alert history for the monitor to see whether this alert has triggered repeatedly, which might indicate a systemic issue worth addressing at the code or infrastructure level.
Favorite the dashboards and monitors you used during the incident. Favorites appear on your personal home page and surface first in global search, reducing navigation time during future incidents.
Add change events to mark the deployment or configuration change that caused the issue, if one hasn’t already been created. These markers persist on time series charts for future responders.

​Step 1: Receive and review the alert

​Step 2: Assess severity and scope

​Identify the affected service and team

​Understand blast radius with the dependency map

​Scan for related alerts

​Step 3: Investigate root cause

​Start with differential diagnosis on metrics

​Apply differential diagnosis to traces

​Use Metrics Explorer to drill into queries

​Check related dashboards

​Compare to a previous time period

​Correlate across telemetry types

​Check change events for correlated deployments

​Generate queries with natural language

​Step 4: Reduce noise while you work

​Create a targeted muting rule

​Verify notification routing

​Step 5: Collaborate and resolve

​Gather evidence in a notebook

​Add comments for active collaboration

​Document the resolution

​Annotate related change events

​What to do after the incident