Find and fix rule and monitor evaluation failures
Recording rules and monitors might not evaluate properly, leaving you with missing data or errors that weren't caught.
You can view your recording rule or monitor failures in Chronosphere Observability Platform, or with Chronoctl.
Click Go to Admin and then select Platform > Rule Status.
The Rule Status page displays. Select Monitors or Recording Rules to focus on a set of errors. Each type's display includes the:
- Time Frame: The amount of time data aggregated. Permanently set to last 5 minutes.
- Total Monitors or Total Recording Rules: The number of this type of definition.
- Failing Monitors or Failing Recording Rules: The number of monitors or rules currently failing to execute.
- Go To Recording Rules: On the Recording Rules tab, this link goes to the recording rules page.
The page provides a table with the following information:
- Execution Status of the rule in the Time Frame.
- Monitor or Recording Rule name.
- Interval the rule evaluates at.
- #Errors shows the number of failed evaluations in the Time Frame.
- Error text explaining the failure.
At the end of the line, click the three vertical dots icon and then a menu option to:
- View Full Error Text: Review the error text in a dialog box.
- Copy Error Text: Copy the text of the error message.
- Go to Monitor: For monitor failures, go to the failed monitor.
Delete low value rules
Use the Telemetry Usage Analyzer to review metrics used in failing rules or monitors. If the metric is low value, deleting the failing rule or monitor might make more sense than fixing it.
Common failures and solutions
Here are some common errors and solutions to help you fix failing rules and monitors:
- Prometheus runtime error: Vector contains metrics with the same labelset after applying labels
- Prometheus runtime error: Found duplicate series for the match group
- Prometheus runtime error: Template errors
- Resources exhausted: The query exceeded the allowable resource limit
- Query timed out: context deadline exceeded
Vector contains metrics with the same labelset after applying labels
Prometheus requires all time series returned from a monitor query or a reporting rule
be fully unique, meaning the entire set of label:value
pairs must differ across a
time series. If metrics have the same labels after applying alert or rule labels, a
collision occurs.
Similar to Prometheus, Observability Platform takes monitor or recording rule labels and overrides the label pairs from all returned time series.
The error message Vector contains metrics with the same labelset after applying alert (or rule) labels
indicates a monitor or recording rule label:value
collision.
For example:
- Monitor or Recording Rule labels:
{"service": "gateway"}
- Fetched time series labels:
{"action": "http_get", "service": "ui-console"}
- Resulting time series:
{"action": "http_get", "service": "gateway"}
In this instance, the ui-console
label is overridden to gateway
after the
monitor or recording rule labels apply. The error occurs in a situation where
collected values look like:
- Monitor or Recording Rule labels:
{"service": "gateway"}
- Fetched time series 1:
{"action": "http_get", "service": "ui-console"}
- Fetched time series 2:
{"action": "http_get", "service": "backend-server"}
Processing rewrites these time series to:
- Resulting time series 1:
{"action": "http_get", "service": "gateway"}
- Resulting time series 2:
{"action": "http_get", "service": "gateway"}
After applying the monitor or recording rule override {"service": "gateway"}
, the
resulting time series are an exact match, which causes an error.
Solutions
Use one of the following methods to resolve the error:
- Use the Prometheus
label_replace
(opens in a new tab) operator to change the underlying label name being overwritten. - Remove the monitor or recording rule label.
Found duplicate time series
The error message Found duplicate series for the match group
indicates two
time series being joined together, but the series don't have the same labels. For
example, one time series might have a
host or instance label,
while the other doesn't.
Review the error message and identify the problematic labels.
Solutions
Use one of the following methods to address this issue:
- Remove the labels from the offending metric. If the labels aren't used in dashboards, monitors, recording rules, or queries, you can create a rollup rule to remove the labels from the metric.
- Update the query to exclude the problematic labels. If other resources use the
label, or you want to keep the labels for any other reason, update the query using
PromQL functions, such as
group
,sum
, ormax
, and use thewithout
option to exclude the labels. For example,group(test1) without (host, instance)
. Refer to the PromQL documentation (opens in a new tab) for the behavior of each function.
Template errors
Invalid Prometheus query templates display errors like undefined variable "$labels"
.
Solutions
Observability Platform attempts to parse your queries using
go template syntax (opens in a new tab). This error
typically means you have a block that looks like {{ <your text here> }}
somewhere
in the raw query. Remove those blocks to fix this issue.
The query exceeded the allowable resource limit
Resource exhaustion occurs when a query has requested more time series than system resources can support. For example, a query that returns millions of results can't process.
Solutions
Use one of the following methods to address this issue:
- Reduce the number of time series returned by the query by adding more label filters. This might not return all of the results you need, so you might need to write multiple recording rules and then update your dashboards and monitors to use the appropriate metric.
- Observability Platform provides rollup rules you can use to remove labels from metrics and aggregate values together, reducing cardinality. Rollup rules can dramatically reduce the number of time series for a particular metric, which might let your query complete.
Context deadline exceeded
This error is functionally similar to queries exceeding the allowable resource limit. Correct these errors with the same solutions.