Troubleshooting

Troubleshooting metrics

Use the following information to help you understand and fix issues with your data not displaying as expected.

This feature isn't available to all Chronosphere users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.

Dips in your metrics graphs

Chronosphere relies on two data points to emit aggregated metrics on counters. When a change interrupts the data stream, Chronosphere has only one data point to operate on in a given interval. In this instance, Chronosphere uses a null value for the previous data point, which causes dips in the graphs using those counters.

Data stream changes that cause gaps in data collection results occur for several reasons, including:

  • A deploy (instance changes).
  • An aggregator or any other part of the collection process fails unexpectedly.
  • Modifying a rollup rule for a counter.
  • If it's the first scrape of a system that has been running for a while.
  • The data is more sparse than the in-memory buffer. For example, the buffer is 10 minutes, data scrapes every 11 minutes.

This graphic shows the graphing behavior using a null value. This behavior is consistent with the expected behavior of Prometheus.

Graph with dip

Sparse time series

Aggregated data on sparse time series over long look back windows can produce inconsistent results when compared to querying raw data for the same period.

The aggregator has a 10 minute time to live (TTL) for time series. If a time series has data points arriving less frequently than the 10 minute window, the aggregator is unable to provide results that have the same fidelity as querying the raw data.

For example, Chronosphere receives a data point comes at t0 and also at t-30. At t0, the aggregator can't compare the datapoint to the one that came in at t-30, so it records an increase as if the t-30 datapoint were null. However, the raw, non-aggregated data would have records for the datapoint at t0 and t-30.

If a query looks for a rate across the raw data, the results retrieved include both the accurate t0 and the accurate t-30 data points and supplies the correct answer. Queries based on rollup data, retrieve some correct data points and others that are misleading based on seemingly missing data and might be incorrect.

Query for missing metrics

To visualize when metrics aren't emitting from a Prometheus endpoint, you can use a specific combination of PromQL functions to build a repeatable query pattern. This can be helpful when attempting to find flapping endpoints or endpoints that have gone down.

Query components

Here are the main functions that make up this query pattern:

For a complete list of functions, visit the PromQL function documentation (opens in a new tab).

Query

This example, which implements all of the previous PromQL functions, examines metrics emitted by the endpoint device and pushed to the Collector push endpoint. The time test series metric that's used is node_load1.

timestamp(sum by (instance) (rate(node_load1{instance=~".+"}[5m]))) unless(timestamp(sum by (instance) (rate(node_load1{instance=~".+"}[5m] offset 5m))))

The query identifies a time series, and then uses unless() to take the complement of a second time series offset by five minutes.

If the metric is currently emitting and was also emitted five minutes in the past, the complement is taken without plotting a point. However, if metrics aren't being currently emitted, but were emitting five minutes in the past, the complement is plotted.

Example query pattern

(
   timestamp(someMetric{})
   unless
   (timestamp(someMetric{})) offset 5m
)

Graph

Graph

In this graph, the empty portions are where metrics are being ingested. The filled-in portions indicate where metrics aren't being sent. The length of the line indicates the duration of time where metrics weren't sent.

Alert

To send an alert when metrics are no longer being emitted, look for values greater than zero.

Managing invalid metrics

This feature isn't available to all Chronosphere users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.

Chronosphere drops time series it considers invalid, at time of ingest. Dropping these series prevents performance degradation and query failures. When an invalid series is dropped, Chronosphere provides information about that series to both you and the customer success team.

Users can view their total rejected-invalid metrics in the Usage dashboard. The existing Current usage panel of the dashboard shows rates of different outcomes for metrics sent by Collectors, such as ingested, dropped (drop policy), dropped (invalid), and persisted.

The dashboard also includes a section for Invalid stats, showing how many invalid metrics were received, broken down by usage tag.

During ingestion, Chronosphere drops time series for several reasons, including if size limits are exceeded or if metrics contain invalid characters.

Exceeding size limits

This feature isn't available to all Chronosphere users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.

Chronosphere limits the size of some metric components:

  • Label names: 512 characters
  • Label values: 1,024 characters
  • Max number of labels: 64 labels
  • Total time series bytes: 4,096 characters

Invalid characters

This feature isn't available to all Chronosphere users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.

Metrics that don't match Prometheus naming conventions might be dropped at ingestion due to special characters.