Troubleshooting missing metrics
Use the following information to help you understand and fix issues with your data not displaying as expected.
Dips in your metrics graphs
Chronosphere relies on two data points to emit aggregated metrics on counters. When a change interrupts the data stream, Chronosphere has only one data point to operate on in a given interval. In this instance, Chronosphere uses a null value for the previous data point, which causes dips in the graphs using those counters.
Data stream changes that cause gaps in data collection results occur for several reasons, including:
- A deploy (instance changes).
- An aggregator or any other part of the collection process fails unexpectedly.
- Modifying a rollup rule for a counter.
- If it's the first scrape of a system that has been running for a while.
- The data is more sparse than the in-memory buffer. For example, the buffer is 10 minutes, data scrapes every 11 minutes.
This graphic shows the graphing behavior using a null value. This behavior is consistent with the expected behavior of Prometheus.
Sparse time series
Aggregated data on sparse time series over long look back windows can produce inconsistent results when compared to querying raw data for the same period.
The aggregator has a 10 minute time to live (TTL) for time series. If a time series has data points arriving less frequently than the 10 minute window, the aggregator is unable to provide results that have the same fidelity as querying the raw data.
For example, Chronosphere receives a data point comes at t0
and also at t-30
. At
t0
, the aggregator can't compare the datapoint to the one that came in at t-30
,
so it records an increase as if the t-30
datapoint were null. However, the raw,
non-aggregated data would have records for the datapoint at t0
and t-30
.
If a query looks for a rate across the raw data, the results retrieved include both
the accurate t0
and the accurate t-30
data points and supplies the correct
answer. Queries based on rollup data, retrieve some correct data points and others
that are misleading based on seemingly missing data and might be incorrect.
Using a sparse metric in an arithmetic operation results in nil
In PromQL, combining time series with arithmetic operations where one operand is nil
results in the entire operation returning nil. A common pattern where this causes
issues is in calculating error rates. If you have a query like
(total - failure) /total
, instead of returning some value when there are no
failures, nil is returned.
Push based metrics issues
Data that looks like it has sparse time series can be due to latency in metrics arrival or long-term downsampling.
Counter metrics often don't start with a zero (0
) value, as they generally exist
before pushing data to Chronosphere. When this occurs, Chronosphere needs to wait for
multiple metrics to arrive before it can return an accurate measurement.
Query for missing metrics
To visualize when metrics aren't emitting from a Prometheus endpoint, you can use a specific combination of PromQL functions to build a repeatable query pattern. This can be helpful when attempting to find flapping endpoints or endpoints that have gone down.
Query components
Here are the main functions that make up this query pattern:
For a complete list of functions, visit the PromQL function documentation (opens in a new tab).
Query
This example, which implements all of the previous PromQL functions, examines metrics
emitted by the endpoint device and pushed to the Collector push endpoint. The time
test series metric that's used is node_load1
.
timestamp(sum by (instance) (rate(node_load1{instance=~".+"}[5m]))) unless(timestamp(sum by (instance) (rate(node_load1{instance=~".+"}[5m] offset 5m))))
The query identifies a time series, and then uses unless()
to take the complement
of a second time series offset by five minutes.
If the metric is currently emitting and was also emitted five minutes in the past, the complement is taken without plotting a point. However, if metrics aren't being currently emitted, but were emitting five minutes in the past, the complement is plotted.
Example query pattern
(
timestamp(someMetric{})
unless
(timestamp(someMetric{})) offset 5m
)
Graph
In this graph, the empty portions are where metrics are being ingested. The filled-in portions indicate where metrics aren't being sent. The length of the line indicates the duration of time where metrics weren't sent.
Alert
To send an alert when metrics are no longer being emitted, look for values greater than zero.
Managing invalid metrics
Observability Platform drops time series when they violate the following validation checks.
When Observability Platform drops an invalid series, use the following dashboards to review information about dropped data:
- The Chronosphere Health Check dashboard includes a panel that shows the rate of rejected data points from all ingestion sources, displayed by the rejection reason.
- The OpenTelemetry Ingestion & Health dashboard includes a panel showing the rate of rejected OTLP metrics data points, displayed by reason.
You can also run the following query in Metrics Explorer to return invalid time series metrics.
sum by (source, reason) (rate(metrics_api_data_points_rejected{}[5m]))
This query returns the rate of rejected data points by ingestion source and rejection reason. For example:
Name | Total | Max | Avg | Last |
---|---|---|---|---|
{reason="label_count_too_high",source="open_telemetry"} | 17K | 130.5 | 70.7 | 73.8 |
{reason="label_count_too_high",source="prometheus"} | 57.7 | 0.281 | 0.24 | 0.225 |
{reason="label_name_invalid",source="open_telemetry"} | 78.7 | 0.411 | 0.327 | 0.358 |
Exceeding size limits
Chronosphere limits the size of some metric components:
- Label names: 512 characters
- Label values: 1,024 characters
- Max number of labels: 64 labels
- Total time series bytes: 4,096 characters
Invalid characters
Metrics that don't match Prometheus naming conventions might be dropped at ingestion due to special characters.
Timestamp in the past
Observability Platform might reject metrics that are too far in the past. See late arriving metrics for more information about the time period in which Observability Platform can accept late metrics.
Missing OpenTelemetry attribute
Observability Platform requires the service.instance.id
attribute for all
OpenTelemetry metric time series to ensure metric writer uniqueness. For more
information, see the OpenTelemetry documentation regarding a
single logical writer (opens in a new tab).
Observability Platform rejects metrics without a service.instance.id
resource
attribute. To configure a value for this attribute, follow the recommendations
for
mapping resource attributes to a Prometheus job and instance.
Unsupported metric types
Observability Platform supports all Prometheus and OpenTelemetry metric types, except for the OpenTelemetry non-monotonic delta sum metric type. See metric types for information about supported metric types.