Case study: responding to incidents - Chronosphere Documentation

When an alert triggers, time matters. This case study follows an on-call engineer through a real-world incident, from the initial page to resolution, demonstrating how Chronosphere Observability Platform’s tools connect to form a complete investigation workflow. For a step-by-step reference version of this case study, see the Incident response guide.

The summons

The phone blared at 2:17 AM. Not the polite chirp of a Slack mention or the quiet nudge of an email. This was the PagerDuty alarm, jarring Alex Reeves out of sleep. Knowing she was on call, she reached for the phone with the muscle memory of someone who has done this a hundred times before, squinting at the screen:

CRITICAL: checkout-service - Error rate exceeded 5% threshold for 3 minutes Monitor: checkout-error-budget-burn Collection: Commerce Platform Team: Payments

Alex was on her feet before she finished reading. Laptop open, password entered, Observability Platform loading in the dark.

The first thread

Alex’s personal home page confirmed the bad news: the Recently triggered alerts panel showed not one, but three alerts triggering in the last 10 minutes. The checkout error monitor was the oldest, and the others belonged to payment-gateway and order-service. Three fires, all at once. Alex clicked the checkout alert title, and the alert details page showed the situation. The time series chart was unambiguous. At 2:14 AM, the error rate for checkout-service had jumped from its baseline of 0.3% to 7.2%, blowing past the 5% threshold. The condition showed > 5% for sustain duration 3m, confirming this wasn’t a flapping alert. The signals grouped the alert by environment=production, region=us-east-1, while us-west-2 remained calm. Whatever this was, it was regional. In the Additional information section, Alex found a link to the Commerce Checkout Dashboard and a runbook titled “Checkout Error Rate Playbook.” The dashboard link would provide context. But first, she needed to understand scope.

Mapping the blast radius

Alex navigated to the service page for checkout-service. The Dependency map showed frontend upstream, calling into checkout-service, which fanned out to payment-gateway, inventory-service, and order-service downstream. The arrows on the edge between checkout-service and payment-gateway bore triple indicators: duration, errors, and request counts were all elevated. She clicked the payment-gateway node. The overlay showed:

Requests: 1,240/min (normal)
Errors: 312/min (elevated, normally ~8/min)
P99 latency: 4,200ms (normally ~180ms)

Something was deeply wrong between checkout and payments. But was payment-gateway the cause, or was the issue elsewhere? She set a pinned scope for environment=production and region=us-east-1, so every tool she opened next would carry that context without her having to re-enter it.

Differential diagnosis

Back on the alert details page, Alex clicked DDx on the error rate chart. Differential diagnosis scanned every label-value combination for checkout_http_requests_total{status=~"5.."} against the past hour’s baseline. The results appeared in seconds, ranked by divergence:

Label	Value	Divergence
`endpoint`	`/v2/process-payment`	94%
`deployment.version`	`v3.41.0`	89%
`pod`	`checkout-service-7f8b9c-*` (all pods)	12%

The /v2/process-payment endpoint was the epicenter. Alex also noted deployment.version=v3.41.0. That was new. She hadn’t seen that version before tonight. A deployment, rolled out while she slept. Alex added the DDx panel to her notebook by clicking the three-dots icon and selecting Add to notebook, naming it 2 AM Checkout Incident. The evidence trail had begun.

Into the metrics

Alex clicked Open in explorer to carry the alert query into Metrics Explorer. The chart appeared pre-populated:

sum(rate(checkout_http_requests_total{status=~"5..", environment="production", region="us-east-1"}[5m]))
/
sum(rate(checkout_http_requests_total{environment="production", region="us-east-1"}[5m]))

Alex added a filter: endpoint="/v2/process-payment". The error rate jumped to 23% for that single endpoint, which was far worse than the aggregate. She used the Query Builder to split by status code and found the culprit: HTTP 503 Service Unavailable status codes dominated, with a smattering of HTTP 504 Gateway Timeout status codes. She added this chart to her notebook, the second piece of evidence.

The dashboard reveals the pattern

Following the annotation link from the monitor, Alex opened the Commerce Checkout Dashboard. The dashboard showed panels for request rates, error breakdowns, latency percentiles, and an Upstream Dependencies panel showing the health of every service that checkout-service calls. The payment-gateway panel was solid red. Latency had spiked from 180 ms to over four seconds at 2:14 AM, and the error panel showed payment-gateway returning HTTP 503 Service Unavailable status codes to every caller. But on the dashboard she noticed something the alert details hadn’t shown: a vertical change event marker at 2:12 AM, two minutes before the errors began. She held her pointer over the marker:

Deploy: checkout-service v3.41.0 Source: ArgoCD Author: deploy-bot Region: us-east-1

There it was. A deployment at 2:12 AM. The errors followed two minutes later at exactly the time it takes for pods to roll through their readiness probes. Alex added the dashboard’s error panel to her notebook, capturing a snapshot so the data would persist even after the time window scrolled past.

Following the trace

Alex needed to understand why v3.41.0 was causing 503s. She returned to the service page and clicked Explore trace data in the dependency map, opening Trace Explorer pre-scoped to checkout-service. She defined her search:

Service: checkout-service
Status: Error
Tag: deployment.version = v3.41.0

Results flooded in, showing hundreds of failed traces in the last 15 minutes. She switched to the Span statistics tab and grouped by service and operation. The leaf errors told the story: while checkout-service showed errors on POST /v2/process-payment, the leaf errors (the deepest failing spans with no failing children) clustered on payment-gateway at the operation grpc.PaymentProcessor/ChargeCard. Alex clicked into a single trace. The flame graph rendered the request’s path: frontend called checkout-service, which called payment-gateway.ChargeCard. The ChargeCard span was red, 4,100 ms long, and terminated with:

error.message: "connection refused: payment-provider.internal:443"

The checkout service’s v3.41.0 deployment had changed the payment provider URL, and the new URL wasn’t resolving in us-east-1. She ran trace DDx against the failing spans, comparing the current window to one hour ago. The results confirmed that deployment.version=v3.41.0 showed 89% divergence from baseline, and net.peer.name=payment-provider.internal was 100% correlated with errors. That tag didn’t exist in the previous version.

The logs confirm it

To confirm the DNS hypothesis, Alex opened Logs Explorer from the service page’s Logs link. Filtered to service=checkout-service and severity=ERROR, the logs appeared:

2026-06-02T06:15:42Z [ERROR] checkout-service/payment_client.go:87
  dial tcp: lookup payment-provider.internal on 10.96.0.10:53:
  no such host

This was repeated hundreds of times. The v3.41.0 deployment had introduced a new payment client that pointed to payment-provider.internal, a hostname that existed in the staging DNS but not in production’s us-east-1 zone. The old endpoint, payment-api.prod.internal, worked fine. This was a configuration error baked into the release. Alex added the log line to her notebook with a markdown note:

“Root cause: v3.41.0 references staging DNS hostname payment-provider.internal which doesn’t resolve in prod us-east-1.”

Silencing the noise

With a root cause identified, Alex needed breathing room. PagerDuty had paged her twice more during the investigation. She returned to the alert details page and clicked Mute alert, creating a muting rule scoped to the specific monitor with a 60-minute duration. The Preview Alerts panel confirmed it would silence only the checkout error alert, not the downstream payment-gateway alerts that her colleague on the Payments team might need to see.

The fix and the record

Alex knew the fix: rollback checkout-service to v3.40.2. She triggered the rollback through their deployment pipeline, then turned to documentation. On the monitor page, she clicked + Add comment to leave a comment for her Payments team colleague:

“Root cause identified: v3.41.0 introduced payment-provider.internal hostname which doesn’t exist in prod us-east-1 DNS. Rolled back to v3.40.2. The payment-gateway errors are downstream of this and should resolve within 5 minutes of rollback completing.”

She created a change event to mark the rollback

Rollback: checkout-service v3.40.2 Reason: v3.41.0 DNS misconfiguration causing 503s

Within four minutes of the rollback completing, the error rate dropped back to 0.3%. The alert resolved itself. Alex added a resolution note on the alert details page:

“Root cause: checkout-service v3.41.0 deployment at 2:12 AM introduced a reference to payment-provider.internal, a hostname that resolves only in staging DNS. Production us-east-1 pods could not reach the payment provider, causing 503s on /v2/process-payment. Rolled back to v3.40.2 at 2:48 AM. Error rate returned to baseline by 2:52 AM.”

She shared her notebook URL in the incident Slack channel, expired the muting rule from Alerting > Muting Rules, and added the Commerce Checkout Dashboard as a favorite so it would surface faster next time.

After

At 3:01 AM, Alex closed her laptop. Forty-four minutes from page to resolution. Every step documented, every chart preserved in the notebook, the deployment marked with a change event that would warn the next engineer who looked at that time window. In the morning, the team would find her trail. The comment on the monitor, the resolution note on the alert, the notebook with its chain of evidence from DDx through Trace Explorer to the log line. The team would know what happened, and how she found it. The deployment pipeline would gain a DNS validation check before the next release reached production.

​The summons

​The first thread

​Mapping the blast radius

​Differential diagnosis

​Into the metrics

​The dashboard reveals the pattern

​Following the trace

​The logs confirm it

​Silencing the noise

​The fix and the record

​After