The summons
The phone blared at 2:17 AM. Not the polite chirp of a Slack mention or the quiet nudge of an email. This was the PagerDuty alarm, jarring Alex Reeves out of sleep. Knowing she was on call, she reached for the phone with the muscle memory of someone who has done this a hundred times before, squinting at the screen:CRITICAL: checkout-service - Error rate exceeded 5% threshold for 3 minutes Monitor: checkout-error-budget-burn Collection: Commerce Platform Team: PaymentsAlex was on her feet before she finished reading. Laptop open, password entered, Observability Platform loading in the dark.
The first thread
Alex’s personal home page confirmed the bad news: the Recently triggered alerts panel showed not one, but three alerts triggering in the last ten minutes. The checkout error monitor was the oldest, and the others belonged topayment-gateway and order-service. Three fires, all at once.
Alex clicked the checkout alert title, and the
alert details page showed the situation.
The time series chart was unambiguous. At 2:14 AM, the error rate for
checkout-service had jumped from its baseline of 0.3% to 7.2%, blowing past the 5%
threshold. The condition showed > 5% for sustain duration 3m, confirming this
wasn’t a flapping alert. The signals
grouped the alert by environment=production, region=us-east-1, while us-west-2
remained calm. Whatever this was, it was regional.
In the annotations section, Alex
found a link to the Commerce Checkout Dashboard and a runbook titled
“Checkout Error Rate Playbook.” The dashboard link would provide context.
But first, she needed to understand scope.
Mapping the blast radius
Alex navigated to the service page forcheckout-service. The Dependency map showed frontend upstream, calling
into checkout-service, which fanned out to payment-gateway,
inventory-service, and order-service downstream. The arrows on the edge
between checkout-service and payment-gateway bore triple indicators:
duration, errors, and request counts were all elevated.
She clicked the payment-gateway node. The overlay showed:
- Requests: 1,240/min (normal)
- Errors: 312/min (elevated, normally ~8/min)
- P99 latency: 4,200ms (normally ~180ms)
payment-gateway the cause, or was the issue elsewhere?
She set a pinned scope for
environment=production and region=us-east-1, so every tool she opened next
would carry that context without her having to re-enter it.
Differential diagnosis
Back on the alert details page, Alex clicked DDx on the error rate chart. Differential diagnosis scanned every label-value combination forcheckout_http_requests_total{status=~"5.."} against the past hour’s baseline.
The results appeared in seconds, ranked by divergence:
| Label | Value | Divergence |
|---|---|---|
endpoint | /v2/process-payment | 94% |
deployment.version | v3.41.0 | 89% |
pod | checkout-service-7f8b9c-* (all pods) | 12% |
/v2/process-payment endpoint was the epicenter. Alex also noted
deployment.version=v3.41.0. That was new. She hadn’t seen that version before
tonight. A deployment, rolled out while she slept.
Alex added the DDx panel to her notebook by clicking the
three-dots icon and selecting Add to notebook, naming it
2 AM Checkout Incident. The evidence trail had begun.
Into the metrics
Alex clicked Open in explorer to carry the alert query into Metrics Explorer. The chart appeared pre-populated:endpoint="/v2/process-payment". The error rate jumped to 23%
for that single endpoint, which was far worse than the aggregate. She used the
Query Builder to split by status
code and found the culprit: HTTP 503 Service Unavailable status codes
dominated, with a smattering of HTTP 504 Gateway Timeout status codes.
She added this chart to her notebook, the second piece of evidence.
The dashboard reveals the pattern
Following the annotation link from the monitor, Alex opened the Commerce Checkout Dashboard. The dashboard showed panels for request rates, error breakdowns, latency percentiles, and an Upstream Dependencies panel showing the health of every service thatcheckout-service calls.
The payment-gateway panel was solid red. Latency had spiked from 180 ms to
over four seconds at 2:14 AM, and the error panel showed payment-gateway
returning HTTP 503 Service Unavailable status codes to every caller. But
on the dashboard she noticed something the alert details hadn’t shown: a
vertical change event marker at
2:12 AM, two minutes before the errors began.
She held her pointer over the marker:
Deploy: checkout-service v3.41.0 Source: ArgoCD Author: deploy-bot Region: us-east-1There it was. A deployment at 2:12 AM. The errors followed two minutes later at exactly the time it takes for pods to roll through their readiness probes. Alex added the dashboard’s error panel to her notebook, capturing a snapshot so the data would persist even after the time window scrolled past.
Following the trace
Alex needed to understand whyv3.41.0 was causing 503s. She returned to the
service page and clicked Explore trace data in the dependency map, opening
Trace Explorer pre-scoped to checkout-service.
She defined her search:
- Service:
checkout-service - Status: Error
- Tag:
deployment.version = v3.41.0
checkout-service showed errors on POST /v2/process-payment, the leaf errors (the
deepest failing spans with no failing children) clustered on payment-gateway at the
operation grpc.PaymentProcessor/ChargeCard.
Alex clicked into a single trace. The flame graph rendered the request’s path:
frontend called checkout-service, which called payment-gateway.ChargeCard. The
ChargeCard span was red, 4,100 ms long, and terminated with:
error.message: "connection refused: payment-provider.internal:443"
The checkout service’s v3.41.0 deployment had changed the payment provider
URL, and the new URL wasn’t resolving in us-east-1.
She ran trace DDx against the
failing spans, comparing the current window to one hour ago. The results confirmed
that deployment.version=v3.41.0 showed 89% divergence from baseline, and
net.peer.name=payment-provider.internal was 100% correlated with errors. That tag
didn’t exist in the previous version.
The logs confirm it
To confirm the DNS hypothesis, Alex opened Logs Explorer from the service page’s Logs link. Filtered toservice=checkout-service and severity=ERROR, the
logs appeared:
v3.41.0 deployment had
introduced a new payment client that pointed to payment-provider.internal,
a hostname that existed in the staging DNS but not in production’s
us-east-1 zone. The old endpoint, payment-api.prod.internal, worked
fine. This was a configuration error baked into the release.
Alex added the log line to her notebook with a markdown note:
“Root cause: v3.41.0 references staging DNS hostname payment-provider.internal
which doesn’t resolve in prod us-east-1.”
Silencing the noise
With a root cause identified, Alex needed breathing room. PagerDuty had paged her twice more during the investigation. She returned to the alert details page and clicked Mute alert, creating a muting rule scoped to the specific monitor with a 60-minute duration. The Preview Alerts panel confirmed it would silence only the checkout error alert, not the downstream payment-gateway alerts that her colleague on the Payments team might need to see.The fix and the record
Alex knew the fix: rollbackcheckout-service to v3.40.2. She triggered the
rollback through their deployment pipeline, then turned to documentation.
On the monitor page, she clicked + Add comment to leave a
comment for her Payments team colleague:
“Root cause identified: v3.41.0 introduced payment-provider.internal
hostname which doesn’t exist in prod us-east-1 DNS. Rolled back to v3.40.2.
The payment-gateway errors are downstream of this and should resolve within 5
minutes of rollback completing.”
She created a change event
to mark the rollback
Rollback: checkout-service v3.40.2 Reason: v3.41.0 DNS misconfiguration causing 503sWithin four minutes of the rollback completing, the error rate dropped back to 0.3%. The alert resolved itself. Alex added a resolution note on the alert details page:
“Root cause: checkout-service v3.41.0 deployment at 2:12 AM introduced a reference toShe shared her notebook URL in the incident Slack channel, expired the muting rule from Alerting > Muting Rules, and added the Commerce Checkout Dashboard as a favorite so it would surface faster next time.payment-provider.internal, a hostname that resolves only in staging DNS. Production us-east-1 pods could not reach the payment provider, causing 503s on/v2/process-payment. Rolled back to v3.40.2 at 2:48 AM. Error rate returned to baseline by 2:52 AM.”

