Troubleshooting missing metrics
Use the following information to help you understand and fix issues with your data not displaying as expected.
Dips in your metrics graphs
This feature isn't available to all Chronosphere Observability Platform users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.
Chronosphere relies on two data points to emit aggregated metrics on counters. When a change interrupts the data stream, Chronosphere has only one data point to operate on in a given interval. In this instance, Chronosphere uses a null value for the previous data point, which causes dips in the graphs using those counters.
Data stream changes that cause gaps in data collection results occur for several reasons, including:
- A deploy (instance changes).
- An aggregator or any other part of the collection process fails unexpectedly.
- Modifying a rollup rule for a counter.
- If it's the first scrape of a system that has been running for a while.
- The data is more sparse than the in-memory buffer. For example, the buffer is 10 minutes, data scrapes every 11 minutes.
This graphic shows the graphing behavior using a null value. This behavior is consistent with the expected behavior of Prometheus.
Sparse time series
Aggregated data on sparse time series over long look back windows can produce inconsistent results when compared to querying raw data for the same period.
The aggregator has a 10 minute time to live (TTL) for time series. If a time series has data points arriving less frequently than the 10 minute window, the aggregator is unable to provide results that have the same fidelity as querying the raw data.
For example, Chronosphere receives a data point comes at t0
and also at t-30
. At
t0
, the aggregator can't compare the datapoint to the one that came in at t-30
,
so it records an increase as if the t-30
datapoint were null. However, the raw,
non-aggregated data would have records for the datapoint at t0
and t-30
.
If a query looks for a rate across the raw data, the results retrieved include both
the accurate t0
and the accurate t-30
data points and supplies the correct
answer. Queries based on rollup data, retrieve some correct data points and others
that are misleading based on seemingly missing data and might be incorrect.
Using a sparse metric in an arithmetic operation results in nil
In PromQL, combining time series with arithmetic operations where one operand is nil
results in the entire operation returning nil. A common pattern where this causes
issues is in calculating error rates. If you have a query like
(total - failure) /total
, instead of returning some value when there are no
failures, nil is returned.
Push based metrics issues
Data that looks like it has sparse time series can be due to latency in metrics arrival or long-term downsampling.
Counter metrics often don't start with a zero (0
) value, as they generally exist
before pushing data to Chronosphere. When this occurs, Chronosphere needs to wait for
multiple metrics to arrive before it can return an accurate measurement.
Query for missing metrics
To visualize when metrics aren't emitting from a Prometheus endpoint, you can use a specific combination of PromQL functions to build a repeatable query pattern. This can be helpful when attempting to find flapping endpoints or endpoints that have gone down.
Query components
Here are the main functions that make up this query pattern:
For a complete list of functions, visit the PromQL function documentation (opens in a new tab).
Query
This example, which implements all of the previous PromQL functions, examines metrics
emitted by the endpoint device and pushed to the Collector push endpoint. The time
test series metric that's used is node_load1
.
timestamp(sum by (instance) (rate(node_load1{instance=~".+"}[5m]))) unless(timestamp(sum by (instance) (rate(node_load1{instance=~".+"}[5m] offset 5m))))
The query identifies a time series, and then uses unless()
to take the complement
of a second time series offset by five minutes.
If the metric is currently emitting and was also emitted five minutes in the past, the complement is taken without plotting a point. However, if metrics aren't being currently emitted, but were emitting five minutes in the past, the complement is plotted.
Example query pattern
(
timestamp(someMetric{})
unless
(timestamp(someMetric{})) offset 5m
)
Graph
In this graph, the empty portions are where metrics are being ingested. The filled-in portions indicate where metrics aren't being sent. The length of the line indicates the duration of time where metrics weren't sent.
Alert
To send an alert when metrics are no longer being emitted, look for values greater than zero.
Managing invalid metrics
This feature isn't available to all Chronosphere Observability Platform users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.
Chronosphere drops time series it considers invalid, at time of ingest. Dropping these series prevents performance degradation and query failures. When an invalid series is dropped, Chronosphere provides information about that series to both you and the customer success team.
Users can view their total rejected-invalid
metrics in the
Usage dashboard. The existing Current usage panel of
the dashboard shows rates of different outcomes for metrics sent by Collectors, such
as ingested, dropped (drop policy), dropped (invalid), and persisted.
The dashboard also includes a section for Invalid stats, showing how many invalid metrics were received, broken down by usage tag.
During ingestion, Chronosphere drops time series for several reasons, including if size limits are exceeded or if metrics contain invalid characters.
Exceeding size limits
This feature isn't available to all Chronosphere Observability Platform users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.
Chronosphere limits the size of some metric components:
- Label names: 512 characters
- Label values: 1,024 characters
- Max number of labels: 64 labels
- Total time series bytes: 4,096 characters
Invalid characters
This feature isn't available to all Chronosphere Observability Platform users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.
Metrics that don't match Prometheus naming conventions might be dropped at ingestion due to special characters.