Using custom PromQL functions in Observability Platform

In addition to PromQL’s standard functions, Chronosphere Observability Platform also supports the following custom functions.

All panel types except Markdown, Service topology, and External use queries.Queries that return an extremely large number of data points or invalid results can result in panel errors. For example, a query might return an error for exceeding server resource limits.Observability Platform reports these errors with an icon that appears in the corner of the Preview pane of the Add panel or Edit panel interfaces, or on the panel when viewing it on the dashboard. Hold the pointer over the icon to view the error message.

`__default_over_time()`

The __default_over_time(v range-vector, defaultValue scalar, lookback scalar) function returns the most recent value from a range vector if it exists within the specified lookback window. Otherwise, the function returns a default value. Use this function when you have metrics that report intermittently and you want to distinguish between “no recent data” and “data shows zero.” The function only processes float values and ignores any histogram samples in the input. The lookback parameter can be specified as either a duration (such as 1m, 5m) or in seconds. The lookback window must be shorter than the range selector to ensure default values get inserted when data is missing. For instance, the following example returns the last value if that value occurred within the past minute, and otherwise returns 0:

__default_over_time(metric[5m], 0, 1m)

In the following example, the function is looking for data points up to five minutes from the current step, and only inserts defaults if there were no data points within five minutes. The function can’t meet this condition, because the range selector and lookback window are the same.

__default_over_time(metric[5m], 0, 5m)

To make the default value appear immediately at the next step, use the $__interval variable. Use this variable when building dashboards for sporadic metrics where you want to show a sensible default, rather than a gap when data hasn’t arrived recently. For example, use $__interval when retrieving batch job results or completing periodic health checks.

__default_over_time(metric[5m], 0, $__interval)

For long range queries, $__interval might exceed the range selector, which could prevent default values from being inserted.

`__histogram_observations()`

The __histogram_observations(lower scalar, upper scalar, v instant-vector) function returns the number of observations that fall within a specified value range from histogram metrics. This function works with both native Prometheus histograms and classic histograms, which helps to answer questions like “How many requests took between 100 ms and 500 ms?” For example, the following query calculates the number of HTTP requests over the last hour (1h) with a duration between 0 and 200 ms (0.2):

__histogram_observations(0, 0.2, rate(http_request_duration_seconds[1h]))

The lower and upper arguments define the boundaries of your range (inclusive). The function interpolates within histogram buckets to estimate the count, so the bounds don’t need to align exactly with your histogram bucket boundaries. The metric name is dropped from the result. Use these capabilities when you need to analyze specific segments of your histogram distribution, such as counting requests in your SLO target range or identifying requests in problematic latency bands. The function is similar to the PromQL histogram_fraction() function, which returns an estimated fraction of observation between the given bounds. For instance, a less efficient query using histogram_fraction() that’s equivalent to the previous __histogram_observations() query requires multiplying the result by the total count of observations:

histogram_fraction(0, 0.2, rate(http_request_duration_seconds[1h])) * histogram_count(rate(http_request_duration_seconds[1h]))

`__histogram_quantiles()`

The __histogram_quantiles() function is deprecated, and is only maintained for backwards compatibility. For new queries, use histogram_quantiles() instead, which provides the flexibility to choose your own label name.

The experimental __histogram_quantiles() function calculates and returns multiple quantiles from Chronosphere histograms (native or exponential histogram) or classic Prometheus histograms in a single query, eliminating the need to write multiple queries to plot multiple quantiles. For example, to create a dashboard panel for API latency that might need to visualize p50, p90, p95, and p99, write the query using __histogram_quantiles() as:

# Chronosphere histogram (native or exponential histogram)
__histogram_quantiles(sum(rate(my_histogram{foo="bar"}[5m])), .5, .9, .95, .99)

# Classic Prometheus histogram
__histogram_quantiles(sum by(le) (rate(my_histogram_seconds_bucket{foo="bar"}[5m])), .5, .9, .95, .99)

The __histogram_quantiles() function returns a result for each quantile differentiated by __hist_quantile__, a synthetic label whose value is the quantile argument used to compute the given result. For example, the example query might return:

Time	`__hist_quantile__`	Value
2025-08-06 11:32:50	0.500	0.0025
2025-08-06 11:32:50	0.900	0.0045000000000000005
2025-08-06 11:32:50	0.950	0.004749999999999999
2025-08-06 11:32:50	0.990	0.00495

`cardinality_estimate()`

The cardinality_estimate() function returns the count estimate of elements in the given instant vector. For example, cardinality_estimate(vec{}) returns the estimate cardinality of the vec metric. Use the cardinality_estimate function in the following ways:

To help approximate cardinality for specific metrics, labels, or label-value pairs over time that can’t be correlated using the Persisted Cardinality Quotas dashboard.
To have a general trend of your cardinality growth over time, because this function can return results for millions of time series.

Don’t use this function to help understand the relative cardinality impact of a particular series on your license. Instead, use the Persisted Cardinality Quotas dashboard to understand cardinality costs across specific teams, services, and pools, and to help pinpoint specific sources of cardinality growth, such as a particular pool or priority group.

The cardinality_estimate function doesn’t measure cardinality in the same 150-minute rolling time window used by license metrics.Instead, this function approximates relative cardinality using 120-minute disjointed blocks, which can create drift. When looking over historical periods of time, the cardinality_estimate function uses even longer blocks.

This function supports grouping time series by labels, and returns an estimate cardinality for each unique value of the label using the by clause in a query. For example, the following query returns the cardinality estimate of all time series that match the metric name with a value for the device label equal to eth0, grouped by unique values for the k8s_cluster label:

cardinality_estimate(node_network_receive_bytes_total{device="eth0"}) by (k8s_cluster)

You can’t group by derived telemetry with this function.

Counting and downsampling

The cardinality_estimate function isn’t a direct alternative to the count function. Because it’s mostly optimized for performance and low-latency use cases, results might not be exact. This function also returns results with much lower resolution than the count function. The resolution aligns with the index block size.

This function provides an alternative to the Prometheus count_over_time function, which isn’t performant when viewing time series with high cardinality.

The cardinality_estimate is affected by long term downsampling of the data it’s based on, and results might change based on the querying window’s time range. When querying the raw namespace, this function returns the count of time series over a two-hour period. However, when querying the downsampled namespace, this function returns the count of time series over a period between 24 hours and four days, which makes the volume look much larger than it actually is.

`cumsum()`

The cumsum(v instant-vector) function returns the cumulative sum of values over time for each series. Use this function to return a running total of a metric across your query time range, rather than point-in-time values. For example, if you have a metric tracking errors and want to see how the total error count accumulates over a full day:

cumsum(sum_over_time(request_error_count{}[$__interval]))

This query transforms a series like 1, 2, 1, 3 into 1, 3, 4, 7. You might use this to visualize cumulative counts for delta counter metrics, like request counts, accumulated bytes transferred, or total events processed since the start of your query window. The function only processes float values, and ignores histograms.

`ewma()`

The ewma(v range-vector, span scalar) function computes an exponentially weighted moving average, which smooths noisy time series data by giving more weight to recent observations while still incorporating historical values. Use this function to filter noise in volatile metrics and expose the underlying trend. The span parameter controls how quickly the average adapts to new values. A smaller span reacts faster to changes, while a larger span provides more smoothing. The span must be greater than 1. For example, to apply a ten-period EWMA to smooth a noisy memory usage metric:

ewma(container_memory_usage_bytes{}[5m], 10)

A shorter span like ewma(container_memory_usage_bytes[10m], 5) tracks changes more closely, which helps with metrics where you want to detect shifts quickly but still reduce noise. The smoothing factor is computed as 2 / (span + 1), so a span of 10 gives approximately 18% weight to each new value.

`head_{agg}`

head_{agg}(q, n) sorts the time series by the largest value based on the specified aggregation function and returns the top n number of series. The list of available head_{agg} functions are:

head_avg
head_min
head_max
head_sum
head_count

For example, head_avg(MY_METRIC{}, 10) returns the top 10 time series sorted by the largest average of their values. In most cases, head_{agg}() is appropriate. However, if you have time series with a high churn rate, such as metrics that track Kubernetes pod level data, use topk(). This is because the head_{agg} family of functions aggregates across all time series in the graph, and if you have a metric with high churns, you can miss outliers (depending on their values). In contrast, topk() takes the top x time series based on their value at each timestamp.

`piecewise_constant()`

The piecewise_constant(v instant-vector) function approximates your time series as a step function with constant-valued segments, effectively identifying when your metric shifts from one level to another. Use this function to detect capacity changes, configuration updates, or other events that cause a metric to move between stable states. For example, if you want to identify when your connection count changes levels:

piecewise_constant(active_database_connections)

This example uses a bottom-up greedy merging algorithm that starts with small segments, and combines adjacent ones when they have similar values. This usage automatically detects how many distinct levels exist in your data and where transitions occur. A metric that oscillates around a stable value is represented as a flat line, while a metric that shifts between states (like 100 connections, then 200 connections, then back to 100) shows clear steps. Use this function to identify when someone scaled your app (causing a step change in resource usage) or to detect when batch processing jobs complete (causing drops in queue depth).

`robust_trend()`

The robust_trend(v instant-vector) function is similar to trend_line, but uses a robust regression technique (Huber loss with Iteratively Reweighted Least Squares) that resists the influence of outliers. This function is ideal when your data contains occasional spikes or anomalies that shouldn’t affect the overall trend calculation. When data is perfectly linear or has no outliers, it produces results similar to trend_line. For example, if your error rate has occasional large spikes that don’t represent the true trend:

robust_trend(error_rate)

This usage is particularly valuable for metrics like network latency that might have occasional dramatic spikes due to transient issues, or for error rates that have periodic anomalous bursts.

`tail_{agg}`

tail_{agg} sorts the time series by the largest value based on the specified aggregation function and returns the bottom n number of series. The list of available tail_{agg} functions are:

tail_avg
tail_min
tail_max
tail_sum
tail_count

For example, tail_avg(MY_METRIC{}, 10) returns the bottom 10 time series sorted by the largest average of their values. In most cases, tail_{agg}() is appropriate. However, if you have times series with a high churn rate, such as metrics that track Kubernetes pod level data, use bottomk(). This is because the tail_{agg} family of functions aggregates across all time series in the graph, and if you have a metric with high churns, you can miss outliers (depending on their values). In contrast, bottomk() takes the bottom x time series based on their value at each timestamp.

`trend_line()`

The trend_line(v instant-vector) function fits an Ordinary Least Squares (OLS) regression through your time series data and returns the fitted trend line values. Use this function to help identify whether a metric is generally increasing, decreasing, or stable, even when the raw data is sporadic. For instance, to see the linear trend of memory usage over time:

trend_line(container_memory_usage_bytes{})

Use this function to compare actual values against the trend, which can help detect when a metric deviates from its expected trajectory. For example, the following function compares current request rates to the linear trend. Values greater than one exceed the trend, and values less than one don’t meet it.

rate(requests[5m]) / trend_line(rate(requests[5m]))

This use of the function helps with capacity planning, understanding long-term metric behavior, or creating baselines. The function requires at least two data points. Single-point series are returned unchanged.

`sum_per_second()`

sum_per_second() calculates the per-second rate for a delta counter or delta histogram time series. It’s equivalent to dividing the result of sum_over_time() by the sliding time window duration. Assuming a step value of 5m, these PromQL queries return the same result:

sum_per_second(http_request_count{}[5m])

sum_over_time(http_request_count{}[5m]) / 300

To ensure the chart value at each step represents the sum of observations for each step’s start and end time, you must set the query’s step size to be equal to the sliding time window value. For more guidance, see Best practices for adding dashboard charts.

`zscore()`

The zscore(v instant-vector) [by|without (labels)] function computes the standard score (z-score) for each series in a group, telling you how many standard deviations each value is from the group mean. Use this function to identify which members of a group are outliers, like when services are behaving differently from the rest. The z-score is calculated as (value - mean) / stddev across all series in a group at each timestamp. Values are mapped to the following averages:

Value	Description
`0`	The value is exactly average
`+1`	One standard deviation above average
`-1`	One standard deviation below average

Values beyond ±2 or ±3 are typically considered outliers. For example, to find which nodes have unusual CPU usage:

zscore(node_cpu_seconds_total)

This query compares all nodes at each point in time and shows which ones deviate from the group average. You can use grouping to compute z-scores within subgroups:

zscore(http_request_duration_seconds) by (datacenter)

This query compares services within each datacenter separately, so you can identify outliers per region rather than globally. This is helpful when different groups have different normal ranges. To create alerts for outliers, you might use this query:

abs(zscore(response_time) by (service)) > 2

This alert triggers when any service’s response time is more than two standard deviations from the mean for its group, helping detect services that are performing unusually poorly or well. The function returns not a number (NaN) when the standard deviation is zero (all values in the group are identical), or when a group contains only a single series. This is equivalent to the following manual calculation, but more concise:

(metric - on() group_left avg(metric)) / on() group_left stddev(metric)

Other querying features

Observability Platform also provides querying features beyond those covered by using query languages in its user interface.

Delta queries: Query metrics that employ delta temporality, as opposed to cumulative temporality.
Prometheus API access: Interact directly with Prometheus API endpoints for programmatic workflows.

​__default_over_time()

​__histogram_observations()

​__histogram_quantiles()

​cardinality_estimate()

​Counting and downsampling

​cumsum()

​ewma()

​head_{agg}

​piecewise_constant()

​robust_trend()

​tail_{agg}

​trend_line()

​sum_per_second()

​zscore()

​Other querying features

`__default_over_time()`

`__histogram_observations()`

`__histogram_quantiles()`

`cardinality_estimate()`

Counting and downsampling

`cumsum()`

`ewma()`

`head_{agg}`

`piecewise_constant()`

`robust_trend()`

`tail_{agg}`

`trend_line()`

`sum_per_second()`

`zscore()`

Other querying features