View recording rules and monitor failures
Select from the following methods to view your recording rule or monitor failures.- Web
- Chronoctl
- API
In the navigation menu, click Go to Admin
and then select
Platform > Rule Status.The Rule Status page displays. Select Monitors or Recording Rules to
focus on a set of errors. Each type’s display includes the:
- Time Frame: The amount of time data aggregated. Permanently set to last 5 minutes.
- Total Monitors or Total Recording Rules: The number of this type of definition.
- Failing Monitors or Failing Recording Rules: The number of monitors or rules currently failing to execute.
- Go To Recording Rules: On the Recording Rules tab, this link goes to the recording rules page.
- Execution Status of the rule in the Time Frame.
- Monitor or Recording Rule name.
- Interval the rule evaluates at.
- #Errors shows the number of failed evaluations in the Time Frame.
- Error text explaining the failure.
- View Full Error Text: Review the error text in a dialog box.
- Copy Error Text: Copy the text of the error message.
- Go to Monitor: For monitor failures, go to the failed monitor.
Delete low value rules
Use the Telemetry Usage Analyzer to review metrics used in failing rules or monitors. If the metric is low value, deleting the failing rule or monitor might make more sense than fixing it.Common failures and solutions
The following list provides some common errors and solutions to help you fix failing rules and monitors.Vector contains metrics with the same label set after applying labels
Prometheus requires all time series returned from a monitor query or a reporting rule be fully unique, meaning the entire set oflabel:value pairs must differ across a
time series. If metrics have the same labels after applying alert or rule labels, a
collision occurs.
Similar to Prometheus, Observability Platform takes monitor or recording rule labels
and overrides the label pairs from all returned time series.
The following error message indicates a monitor or recording rule label:value
collision:
- Monitor or Recording Rule labels:
{"service": "gateway"} - Fetched time series labels:
{"action": "http_get", "service": "ui-console"} - Resulting time series:
{"action": "http_get", "service": "gateway"}
ui-console label is overridden to gateway after the
monitor or recording rule labels apply. The error occurs in a situation where
collected values look like:
- Monitor or Recording Rule labels:
{"service": "gateway"} - Fetched time series 1:
{"action": "http_get", "service": "ui-console"} - Fetched time series 2:
{"action": "http_get", "service": "backend-server"}
- Resulting time series 1:
{"action": "http_get", "service": "gateway"} - Resulting time series 2:
{"action": "http_get", "service": "gateway"}
{"service": "gateway"}, the
resulting time series are an exact match, which causes an error.
Use one of the following methods to resolve the error:
- Use the Prometheus
label_replaceoperator to change the underlying label name being overwritten. - Remove the monitor or recording rule label.
Found duplicate time series
The error messageFound duplicate series for the match group indicates two
time series being joined together, but the series don’t have the same labels. For
example, one time series might have a
host or instance label,
while the other doesn’t.
Review the error message and identify the problematic labels.
Use one of the following methods to address this issue:
- Remove the labels from the offending metric. If the labels aren’t used in dashboards, monitors, recording rules, or queries, you can create a rollup rule to remove the labels from the metric.
- Update the query to exclude the problematic labels. If other resources use the
label, or you want to keep the labels for any other reason, update the query using
PromQL functions, such as
group,sum, ormax, and use thewithoutoption to exclude the labels. For example,group(test1) without (host, instance). Refer to the PromQL documentation for the behavior of each function.
Template errors
Invalid Prometheus query templates display errors likeundefined variable ”$labels”`.
Observability Platform attempts to parse your queries using
go template syntax. This error
typically means you have a block that looks like {{ <your text here> }} somewhere
in the raw query. Remove those blocks to fix this issue.
The query exceeded the allowable resource limit
Resource exhaustion occurs when a query has requested more time series than system resources can support. For example, a query that returns millions of results exceeds the query scale protections defined by limits the system can process. Use one of the following methods to address this issue:- Reduce the number of time series returned by the query by adding more label filters. This might not return all of the results you need, so you might need to write multiple recording rules and then update your dashboards and monitors to use the appropriate metric.
- Observability Platform provides rollup rules you can use to remove labels from metrics and aggregate values together, reducing cardinality. Rollup rules can dramatically reduce the number of time series for a particular metric, which might let your query complete.