Find and fix rule and monitor evaluation failures

Recording rules and monitors might not evaluate properly, leaving you with missing data or errors that weren’t caught.

Select from the following methods to view your recording rule or monitor failures.

In the navigation menu, click Go to Admin and then select Platform > Rule Status.

The Rule Status page displays. Select Monitors or Recording Rules to focus on a set of errors. Each type’s display includes the:

Time Frame: The amount of time data aggregated. Permanently set to last 5 minutes.
Total Monitors or Total Recording Rules: The number of this type of definition.
Failing Monitors or Failing Recording Rules: The number of monitors or rules currently failing to execute.
Go To Recording Rules: On the Recording Rules tab, this link goes to the recording rules page.

The page provides a table with the following information:

Execution Status of the rule in the Time Frame.
Monitor or Recording Rule name.
Interval the rule evaluates at.
#Errors shows the number of failed evaluations in the Time Frame.
Error text explaining the failure.

At the end of the line, click the three vertical dots icon and then a menu option to:

View Full Error Text: Review the error text in a dialog box.
Copy Error Text: Copy the text of the error message.
Go to Monitor: For monitor failures, go to the failed monitor.

Delete low value rules

Use the Telemetry Usage Analyzer to review metrics used in failing rules or monitors. If the metric is low value, deleting the failing rule or monitor might make more sense than fixing it.

Common failures and solutions

Here are some common errors and solutions to help you fix failing rules and monitors:

Prometheus runtime error: Vector contains metrics with the same labelset after applying labels
Prometheus runtime error: Found duplicate series for the match group
Prometheus runtime error: Template errors
Resources exhausted: The query exceeded the allowable resource limit
Query timed out: context deadline exceeded

Vector contains metrics with the same label set after applying labels

Prometheus requires all time series returned from a monitor query or a reporting rule be fully unique, meaning the entire set of label:value pairs must differ across a time series. If metrics have the same labels after applying alert or rule labels, a collision occurs.

Similar to Prometheus, Observability Platform takes monitor or recording rule labels and overrides the label pairs from all returned time series.

The following error message indicates a monitor or recording rule label:value collision:

Vector contains metrics with the same label set after applying alert (or rule) labels`

For example:

Monitor or Recording Rule labels: {"service": "gateway"}
Fetched time series labels: {"action": "http_get", "service": "ui-console"}
Resulting time series: {"action": "http_get", "service": "gateway"}

In this instance, the ui-console label is overridden to gateway after the monitor or recording rule labels apply. The error occurs in a situation where collected values look like:

Monitor or Recording Rule labels: {"service": "gateway"}
Fetched time series 1: {"action": "http_get", "service": "ui-console"}
Fetched time series 2: {"action": "http_get", "service": "backend-server"}

Processing rewrites these time series to:

Resulting time series 1: {"action": "http_get", "service": "gateway"}
Resulting time series 2: {"action": "http_get", "service": "gateway"}

After applying the monitor or recording rule override {"service": "gateway"}, the resulting time series are an exact match, which causes an error.

Use one of the following methods to resolve the error:

Use the Prometheus label_replace (opens in a new tab) operator to change the underlying label name being overwritten.
Remove the monitor or recording rule label.

Found duplicate time series

The error message Found duplicate series for the match group indicates two time series being joined together, but the series don’t have the same labels. For example, one time series might have a host or instance label, while the other doesn’t.

Review the error message and identify the problematic labels.

Use one of the following methods to address this issue:

Remove the labels from the offending metric. If the labels aren’t used in dashboards, monitors, recording rules, or queries, you can create a rollup rule to remove the labels from the metric.
Update the query to exclude the problematic labels. If other resources use the label, or you want to keep the labels for any other reason, update the query using PromQL functions, such as group, sum, or max, and use the without option to exclude the labels. For example, group(test1) without (host, instance). Refer to the PromQL documentation (opens in a new tab) for the behavior of each function.

Template errors

Invalid Prometheus query templates display errors like undefined variable ”$labels”`.

Observability Platform attempts to parse your queries using go template syntax (opens in a new tab). This error typically means you have a block that looks like {{ <your text here> }} somewhere in the raw query. Remove those blocks to fix this issue.

The query exceeded the allowable resource limit

Resource exhaustion occurs when a query has requested more time series than system resources can support. For example, a query that returns millions of results exceeds the query scale protections defined by limits the system can process.

Use one of the following methods to address this issue:

Reduce the number of time series returned by the query by adding more label filters. This might not return all of the results you need, so you might need to write multiple recording rules and then update your dashboards and monitors to use the appropriate metric.
Observability Platform provides rollup rules you can use to remove labels from metrics and aggregate values together, reducing cardinality. Rollup rules can dramatically reduce the number of time series for a particular metric, which might let your query complete.

Context deadline exceeded

This error is functionally similar to queries exceeding the allowable resource limit. Correct these errors with the same solutions.

Drop rules Quotas and pools