Step 1: Receive and review the alert
Chronosphere Observability Platform routes alert notifications through the channels your team has configured, such as PagerDuty, OpsGenie, Slack, email, or a webhook. When a notification arrives, it links back to the triggering alert in Observability Platform. If you’re already in the product, view your personal home page for recently triggered alerts, recent comments from teammates, and activity on resources you’ve added as favorites. Global search (Control+K or Command+K on macOS) also jumps directly to a
monitor, dashboard, or service by name.
To find active alerts directly in the product:
- In the navigation menu, select Alerting > Alerts.
- Review the Status column to distinguish Critical, Warning, Muted, and Resolved alerts.
- Click an alert’s title to open its alert details page.
- The alert’s current status and the triggering signals
- A time series chart of the query that triggered the alert, with optional threshold overlays
- The conditions that triggered, including operator, threshold, and sustain duration
- Any change events that correlate with the alert window
- The alert’s owner collection and responsible team
- Annotations defined on the monitor, such as runbook links and related dashboard links
Step 2: Assess severity and scope
Before running queries, establish the scope to understand the surface area you’re dealing with.Identify the affected service and team
The Alert information sidebar on the alert details page lists the owning collection and team responsible for the alert’s source. If the alert is about a service you don’t own, use this to identify who to contact. Click Return to source monitor (or Return to source SLO) to open the entity that defined the alert. From there, review the full monitor configuration, its notification policy, and its signal grouping.Understand blast radius with the dependency map
If the alert originates from a known service, open the service page to see upstream and downstream dependencies:- In the navigation menu, select Go to Admin > System Overview > Services, then click the service name.
- In the Dependency map, examine the arrows on edges between services. Single, double, and triple arrows indicate increasing trend magnitude for duration, requests, and errors.
- Click a service in the map to see per-service totals for requests, errors, leaf errors, and latency percentiles.
Scan for related alerts
Return to Alerting >Alerts and scan the list for other alerts that triggered in the same time window. Alerts occurring across multiple services suggest a broad infrastructure or network issue rather than a code-level regression in a single service. Adjust the time range on the alert details page to widen or narrow the window and confirm whether the alert is ongoing or already resolving.Step 3: Investigate root cause
Finding the root cause of an issue can be complicated, so Observability Platform provides numerous tools to help target the problem and resolve it.Start with differential diagnosis on metrics
Differential diagnosis (DDx) scans all label-value combinations for a metric to find which ones are statistically correlated with the anomaly. Use DDx to convert a noisy alert signal into a specific hypothesis without manually comparing dozens of series. From an alert details page:- Click DDx in the alert’s query visualization section.
- Review which label-value combinations show the highest divergence from baseline.
- Use those labels as starting filters for deeper investigation in Metrics Explorer.
- Click a service node or edge, then click the three dots icon next to the trend statistic you want to investigate.
- Select Differential Diagnosis.
Apply differential diagnosis to traces
If the alert involves latency or error-rate spikes and your service emits traces, run DDx in Trace Explorer. DDx scans alltag:value pairs across spans to identify which
combinations are unusually correlated with errors or high latency, such as a
specific deployment.version or host.region.
- In Trace Explorer, define a search that captures the problematic traffic.
- Click the Differential Diagnosis tab.
- Compare current data against a baseline period immediately before the incident to distinguish newly correlated tags from those that are always present.
Use Metrics Explorer to drill into queries
Click Open in explorer from the alert details page to open Metrics Explorer pre-populated with the triggering query. From there:- Adjust label filters to isolate specific hosts, pods, or regions
- Use the Query Builder to modify the query without writing PromQL from scratch
- Toggle the time range to compare current behavior to a prior window
Check related dashboards
If the monitor’s annotations link to a dashboard, open it for a pre-built operational view of the affected service. Dashboards combine multiple panels, including time series, gauges, tables, and service topology, that the team curated for exactly this kind of investigation. Dashboards are also available through:- The service page’s connected resources
- Your personal home page, which lists recent and favorite dashboards
- Global search (
Control+KorCommand+Kon macOS), filtered to Dashboards
Compare to a previous time period
Use the Compare option in the time range selector to overlay data from a prior period, for example, the same time window one week ago. This reveals whether the current behavior is anomalous or matches a recurring pattern, such as a weekly traffic cycle. On service pages, drag a region in any chart to synchronize all panels to that time window, making cross-signal correlation faster.Correlate across telemetry types
Observability Platform connects metrics, traces, and logs. To carry context forward when moving between them:- From a service page, click Explore trace data in the dependency map to open Trace Explorer pre-scoped to that service.
- From Trace Explorer, identify leaf error spans. These are errors with no failing child span, which typically indicate the actual source of a failure rather than a propagated error.
- Use pinned scopes to carry label filters such as
environment=productionacross Metrics Explorer, service pages, and dashboards without re-entering them on each page.
Check change events for correlated deployments
Change events appear as vertical markers on time series charts across service pages, dashboards, and Metrics Explorer. Look for markers that align with the start of the anomaly. A deployment, config change, or feature flag flip often correlates directly with the alert. To search across all recent changes, open Changes Explorer and filter by time range, service, or event source. Change events enabled on dashboards and service pages also display inline without additional configuration.Generate queries with natural language
If you’re unfamiliar with the affected service’s metrics or don’t know the exact PromQL or log query syntax, use natural language queries to describe what you’re looking for. Click Edit with AI in Metrics Explorer or Logs Explorer (or pressControl+I / Command+I on macOS) and enter a prompt
such as:
error rate for checkout service in us-east-1 last hourObservability Platform generates a query using semantic metric search to match relevant metrics by intent rather than exact name.
Step 4: Reduce noise while you work
Alerts that are triggering while you investigate can generate repeated notifications. Muting the right alerts without over-muting lets you work without distraction.Create a targeted muting rule
From the alert details page, click Mute alert to open a muting rule pre-populated with the alert’s source and name. This approach is the safest way to mute, as the scope derives from the specific alert rather than relying on manual entry. Before saving, the Preview Alerts panel shows exactly which alerts the rule will silence. Review this list to confirm you’re not accidentally muting alerts from unrelated services. To set a narrow scope, choose A Monitor rather than Time Series unless you have a specific reason to mute based on label matching. Set a duration that matches your expected resolution window. You can always extend or expire the rule early. For more granular control, such as muting only a specific pod or environment, use Time Series and define the label matcher. Use regular expression matching when you need to mute a pattern of related series. After the incident resolves, expire the rule immediately from Alerting > Muting Rules rather than letting it run to its scheduled end time.Verify notification routing
If you’re unsure whether notifications are reaching the right people, open the source monitor and click Test notifications to send a synthetic alert through the same routing logic as a real alert. This confirms that notifiers and templates are working as expected without waiting for another real event to fire.Step 5: Collaborate and resolve
Use notebooks, comments, and resolution notes to share investigation context with other responders and create a permanent record of what happened and how you fixed it.Gather evidence in a notebook
As you investigate, add charts, query results, and dashboard panels to a notebook to build a running evidence file. Click the Notebook icon in the page header or right sidebar to open one. From any panel or explorer result, click the three-dots menu and select Add to notebook, or drag panels directly from a dashboard into the open notebook. Notebooks support:- Panels from dashboards, Metrics Explorer, Logs Explorer, DDx, alert details, SLO charts, service pages, monitors, and the metrics catalog
- A dedicated Add to notebook action in Logs Explorer that adds both a log volume histogram and the current results visualization
- Copying a URL from any resource page and pasting it into the notebook to embed a link card
- Per-panel time range overrides, so you can compare the same chart at different points in time
- Snapshots to freeze a panel’s data for long-term reference
- Panel editing to refine queries without leaving your evidence file
- Version history to review or restore earlier states of the notebook
Add comments for active collaboration
Comments let you annotate monitors, metrics, and change events with notes visible to anyone viewing that resource. From the monitor’s page, click + Add comment to leave notes for other responders working the same incident. Comments also persist on metrics in Metrics Explorer and on change events in Changes Explorer. Use them to record institutional knowledge such as “this metric spikes every Tuesday during batch processing” so the next responder doesn’t retrace the same investigation.Document the resolution
When the alert resolves, add a resolution note on the alert details page. Resolution notes:- Accept Markdown, so you can include code snippets, links, and structured text
- Associate with specific signals so notes for different affected services stay organized
- Persist after the alert closes, creating a searchable record for post-incident reviews
- On the alert details page, click + Add in the Resolution notes section.
- Enter a description of what you found and what action you took.
- Click Create.
Annotate related change events
If a deployment or configuration change correlates with the alert, create a change event from the monitor page. Change events appear in time series charts across Observability Platform, making the correlation visible to anyone who views that time window in the future.What to do after the incident
The incident response workflow doesn’t end at resolution. After closing the incident, consider taking the following actions to help other users remediate future issues with less investigation:- Improve monitor signal quality: if the alert triggered but required significant manual triage, edit the monitor’s annotations to add runbook links, dashboard links, or data links to relevant traces for the next responder.
- Adjust thresholds: if the alert was noisy, triggering frequently on non-issues, review the monitor’s conditions and sustain duration to better reflect real risk.
- Review the alert history for the monitor to see whether this alert has triggered repeatedly, which might indicate a systemic issue worth addressing at the code or infrastructure level.
- Favorite the dashboards and monitors you used during the incident. Favorites appear on your personal home page and surface first in global search, reducing navigation time during future incidents.
- Add change events to mark the deployment or configuration change that caused the issue, if one hasn’t already been created. These markers persist on time series charts for future responders.

