Rollup rules
To downsample and aggregate metrics after they're sent by the client but before they're stored, create rollup rules.
Rollup rules are a type of aggregation rule that help you reduce the cardinality footprint of your metrics by dropping raw data to eliminate unneeded labels. High cardinality footprints can cause slow dashboards and queries.
If you're working with late-arriving data, rollup rules are well suited for ensuring all of your data aggregates the way you need it.
As an example, instance
or pod
labels don't often add value on their own,
but removing these labels from the client side isn't always possible. You can
use rollup rules to avoid storing these labels.
Rollup rules support both Prometheus and Graphite metrics.
Rollup rule attributes
To accurately aggregate your data, rollup rules require you to both configure multiple fields and to have an understanding of aggregation operations.
Fields for rollup rules
The following fields are part of the rollup_rule
object that you define when
creating a rollup rule with Terraform, Chronoctl, and the
CreateRollupRule API.
This list isn't comprehensive. See the Rollup rules API documentation for the complete list of fields.
-
aggregation
: Specifies how to combine the grouped metrics. See the supported aggregation operations for specific help. -
drop_raw
: Defaults tofalse
. Set totrue
to remove raw metrics that match this rollup rule. For more information, see Mapping rules. -
filter
: Filters incoming metrics by label. If multiple label filters are specified, an incoming metric must match every label filter to match the rule. Label values support glob patterns, including matching multiple patterns with anOR
, such asservice:{svc1,svc2}
.Several special filters are available for matching metrics by non-label request metadata:
Filter Description Valid values __metric_type__
Matches on the incoming metric's Observability Platform metric type. This is the recommended method for filtering on metric type. cumulative_counter
,delta_counter
,gauge
, ormeasurement
__metric_source__
Matches on the incoming metric's source format. carbon
,chrono_gcp
,dogstatsd
,open_metrics
,open_telemetry
,prometheus
,signalfx
,statsd
, orwavefront
__m3_prom_type__
When ingesting with Prometheus, matches on the incoming metric's Prometheus metric type. counter
,gauge
,histogram
,gauge_histogram
,summary
,info
,state_set
, orquantile
__otel_type__
When ingesting with OpenTelemetry, matches on the incoming metric's OpenTelemetry metric type. sum
,monotonic_sum
,gauge
,histogram
,exp_histogram
, orsummary
__otel_temporality__
When ingesting with OpenTelemetry, matches on the incoming metric's OpenTelemetry temporality. delta
orcumulative
__m3_type__
DEPRECATED. Matches on the incoming metric's legacy M3 type. counter
,gauge
, ortimer
Example:
__metric_type__:cumulative_counter service:gateway __name__:http_requests_*
This filter matches any cumulative counter metric with a
service=gateway
label whose metric name starts withhttp_requests_
. -
expansive_match
: A series matches and aggregates only if each label defined by thelabel_policy.keep
orgraphite_label_policy.replace
filters (respectively) exist in the series. Settingexpansive_match
totrue
removes this restriction. Defaults tofalse
.If
false
, a series matches and aggregates only if each label defined by the providedfilters
and thelabel_policy.keep
orgraphite_label_policy.replace
settings exist in the series. Defaults tofalse
. -
interval
: The distance in time between aggregated data points. Intervals are based on your retention policy. Use this optional field to set a custom interval. (Known asstorage_policies
in version 0.286.0-2023-01-06-release.1 and earlier.) -
A label policy: Label policies act as a filter, defining which labels to preserve in the resulting metric. Use
group_by
to keep one or more labels, orexclude_by
to ignore one or more labels. (Known askeep
anddiscard
in the Aggregation Rules UI). -
metric_type
: The metric type aggregated. Valid options vary depending on your data. See the supported aggregation operations for specific help.Each rollup rule must declare the type of metric it aggregates by setting the
metric_type
field, because each metric type aggregates differently.⚠️Choosing the wrong
metric_type
for your rule can produce unexpected results. -
new_metric
(Terraform) |metric_name
(Chronoctl, API): The name of the new metric to create and persist to the database. You can use the template string{{.MetricName }}
to create a new metric name that references the original metric name. For instance,new_metric: '{{ .MetricName }}:by_instance'
outputs a metric with the namemy_metric:by_instance
if the matched metric ismy_metric
.This field is optional for Graphite rollup rules.
Label policies
You define which labels to preserve in the resulting metric through the use of label policies. To do this, add the appropriate field to the rollup rule definition.
You can set only one of group_by
or exclude_by
per rollup rule. Graphite metrics
support only the exclude_by
rule type.
-
group_by
(keep
in the Aggregation Rules UI)When using
group_by
rollup rules, you must specify the labels by which to aggregate the metrics. The rule aggregates only metrics that contain all of thekeep
labels.group_by
retains only the selected labels and discards any other labels. If a metric doesn't include all of the labels specified bygroup_by
, the metric isn't included in the rule.Use a
group_by
rule when there are individual metrics you can filter with the__name__
label that you want to aggregate. -
exclude_by
(discard
in the Aggregation Rules UI)With an
exclude_by
rollup rule, you specify which labels to remove from the aggregated metric, while keeping all other labels.Use an
exclude_by
rule when you want to target a group of metrics for a particular service, team, or other higher level set of metrics.
Set a Graphite label policy
For Graphite metrics, you can use the graphite_label_policy
parameter to also
set a Graphite-specific label policy. This lets you define replacements for label
values without changing their positions, which can reduce cardinality without breaking
Graphite metrics' preferred positional indexing.
For example, assume you have raw metric names that follow this pattern:
cluster.production.instance.instance1.requests_count
cluster.production.instance.instance2.requests_count
...
You can create a Graphite label policy that defines a replacement rule that replaces
the third positional label name (__g3__
) with a new string value (INSTANCE
).
This replacement aggregates these metrics as
cluster.production.instance.INSTANCE.requests_count
, without changing their positional
indexing.
The output of the chronoctl rollup-rules scaffold
command includes the graphite_label_policy
parameter:
api_version: v1/config
kind: RollupRule
spec:
...
graphite_label_policy:
# Required list of labels to replace. Useful for discarding
# high-cardinality values while still preserving the original positions of
# the Graphite metric.
replace:
- # Required name of the label whose value should be replaced. Only
# '__gX__' labels are allowed (aka positional Graphite labels).
name: <string>
# Required new value of the replaced label.
new_value: <string>
...
To implement the rule from the example scenario as a Chronoctl YAML resource, define
the name
and new_value
in the list of replace
values:
api_version: v1/config
kind: RollupRule
spec:
...
graphite_label_policy:
replace:
- name: "__g3__"
new_value: "INSTANCE"
...
Define multiple replacements in a single rollup rule by adding more pairs of name
and new_value
to the replace
list.
Supported aggregation operations
Some operations can change the type of the metric during aggregation. The resulting metric type of an aggregation is called the output metric type.
Even if you are ingesting data with the wrong metric type, configure your rollup rule
with the metric type that the ingested data should be. For example, if Chronosphere
Observability Platform ingests metrics with type GAUGE
, but the values actually
represent DELTA_COUNTER
, use a metric_type=DELTA_COUNTER
rollup rule to aggregate
them.
Rollup rules support the following aggregation operations:
CUMULATIVE_COUNTER
Cumulative counters support these aggregations:
-
SUM
: Takes the increase of each individual input series within the configured interval, then sums the increases together according to the configured label policy. The output is the cumulative summed increase across all input series. -
COUNT
: Counts the number of unique input series matched by the configured label policy (for example, cardinality).
The output type of all cumulative counter aggregations is a CUMULATIVE_COUNTER
.
GAUGE
Gauges support the following aggregation methods:
-
SUM
: Takes the max value of each individual input series within the configured interval, then sums all final values together by the configured label policy. -
COUNT
: Counts the number of unique input series matched by the configured label policy (for example, cardinality). -
MIN
: Takes the minimum value of all data points within the configured interval across all series matched by the configured label policy. -
MAX
: Takes the maximum value of all data points within the configured interval across all series matched by the configured label policy. -
PXX
,MEAN
,MEDIAN
,STDEV
,SUMSQ
: Takes the maximum value of each individual input series within the configured interval, then computes the desired value distribution.
The output type of all gauge aggregations is a GAUGE
.
When querying a gauge metric with a range vector included in the query downsampling might impact the accuracy of the query result. Most use cases that fit this criteria can be converted to use counters instead, which avoids the issue.
DELTA_COUNTER
Supported aggregations:
-
SUM
: Sums all values of all series matched by the configured label policy. All values must be nonnegative. -
COUNT
: Counts the number of unique input series matched by the configured label policy (for example, cardinality). -
COUNT_SAMPLES
: Counts the number of input samples matched by the configured label policy.
The output type of all delta counter aggregations is a cumulative counter.
Exceptions for DELTA_COUNTER
metrics
DELTA_COUNTER
metrics don't require the following fields for rollup rules:
name
aggregation
keep
discard
MEASUREMENT
A key feature of MEASUREMENT
aggregations lies in how they treat individual
samples. Unlike other types such as GAUGE
and CUMULATIVE_COUNTER
,
MEASUREMENT
metrics aggregate all at once, across all samples of your matching
time series within the aggregated time interval. This enables calculation of
accurate statistics server-side, within Observability Platform.
A typical use case for MEASUREMENT
aggregations is calculating statistics across
raw request latencies across all instances. This can be correctly performed through
metric_type=MEASUREMENT
and aggregation=P95
. Using metric_type=GAUGE
in this
scenario produces undesired results, discarding all samples except the per-instance
max value, then computing the ninety-fifth percentile across these per-instance max
values.
Measurements support all aggregation methods:
-
SUM
: Sums all values of all series matched by the configured label policy. All values must be nonnegative. The output metric type is aCUMULATIVE_COUNTER
. -
COUNT
: Counts the number of unique input series matched by the configured label policy (for example,cardinality). The output metric type is aCUMULATIVE_COUNTER
. -
COUNT_SAMPLES
: Counts the number of input samples matched by the configured label policy. The output metric type is aCUMULATIVE_COUNTER
. -
LAST
: Takes the last value of all samples matched by the configured label policy. The output metric type is aGAUGE
. -
MIN
: Takes the minimum value of all samples matched by the configured label policy. The output metric type is aGAUGE
. -
MAX
: Takes the maximum value of all samples matched by the configured label policy. The output metric type is aGAUGE
. -
PXX
,MEAN
,MEDIAN
,STDEV
,SUMSQ
: Computes the desired value distribution across all samples matched by the configured label policy. The output metric type is aGAUGE
. -
HISTOGRAM
: Summarizes the distribution of values as an exponential histogram with a scale of 3. The output type is aCUMULATIVE_EXPONENTIAL_HISTOGRAM
.
Supported histograms aggregation operations
If either the input histogram or resulting aggregation exceeds the 160-bucket limit, Observability Platform decreases the exponential histogram scale until the bucket count is within the limit. Downscaling reduces the exponential histogram's resolution.
CUMULATIVE_EXPONENTIAL_HISTOGRAM
Cumulative exponential histogram aggregations operate on OpenTelemetry exponential histograms with cumulative temporality, and on Prometheus native histograms with an exponential bucket layout.
Cumulative exponential histograms support this aggregation method:
SUM
: Merges input cumulative exponential histograms by the configured label policy. The output metric type is aCUMULATIVE_EXPONENTIAL_HISTOGRAM
.
DELTA_EXPONENTIAL_HISTOGRAM
Delta exponential histogram aggregations operate on OpenTelemetry exponential histograms with delta temporality.
Delta exponential histograms support this aggregation method:
SUM
: Merges input delta exponential histograms by the configured label policy. The output metric type is aCUMULATIVE_EXPONENTIAL_HISTOGRAM
.
View rollup rules
Select from the following methods to view rollup rules.
In Observability Platform, view rollup rules in the Aggregation rules UI.
For information about viewing, copying, or downloading rule configurations, see Rule configuration.
Create a rollup rule
Select from the following methods to apply rollup rules. Observability Platform doesn't limit the number of rollup rules a system can have.
If you define a rollup rule using the Observability Platform app, you must download the rule configuration and apply it with one of the supported methods.
Create rollup rule configurations in Observability Platform from the Aggregation rules UI.
When creating a rule configuration, the Visual Editor displays by default. When creating a rule in Metrics Analyzer, the dialog pre-populates fields based on the user's selected data.
To create a rule configuration:
- Enter or edit data for the following fields:
- Rule name: Add or edit the name of the rule.
- Rule mode: Either Rule Preview or Rule Enabled.
- Matching Time Series: Time series the rule applies to.
Comma-separated, and supports
glob syntax. Add a Label, a
function (
=
or!=
), and a Value. Click Add to add another time series. - Labels to Roll Up: Discard Labels, or Keep Labels. Add labels to the Input Labels text box.
- Output Metric: The new metric's name and aggregation configuration.
- Output Metric Name: Edit the output metric name. Clear the checkbox for Include metric name to remove the original name.
- Input Metric Type: Select a metric type.
- Aggregation: Select an aggregation operation.
- Sample Interval: The length of time between samples.
- Raw Data: Select the toggle to drop the raw input data after aggregation.
- When finished, click Code Config.
- Choose your rule creation method from these options:
- Chronoctl
- Terraform
- API
- Apply the changes based on your selected method.
Rollup rules take effect immediately, but can require a full recording interval to show a change.
Best practices for rule creation
Following these guidelines helps ensure your rollup rules work as intended:
- Use Live Telemetry Analyzer to verify your glob syntax to ensure your query matches the correct metrics.
- Before using a rollup rule to group labels, be sure those labels aren't used in other places, such as dashboards, monitors, or the queries you use to debug issues.
- Filters using curly braces (
{}
) shouldn't use a dash (-) in the filter for label names. The single filter identifies this as a range. For example,service_cluster: !{my-label}
fails. Rewrite the filter toservice_cluster:!human-label
instead. - Metrics can match more than one rule. Matching multiple rules can affect data retention. If a rule matches any `drop_raw=true", raw metrics are dropped.
Chronoctl rollup rule example
Here's an example of a rollup rule that matches time series with the value
permits_blocked
, while discarding any labels matching instance
and job
. It uses
a counter type metric, and aggregates as a sum using a 30-second interval.
api_version: v1/config
kind: RollupRule
spec:
slug: permits_blocked_without_instance
name: permits blocked without instance
filters:
- name: __name__
value_glob: permits_blocked
expansive_match: false
metric_name: '{{ .MetricName }}:without_instance'
metric_type: COUNTER
aggregation: SUM
interval: 30s
label_policy:
discard:
- instance
- job
mode: ENABLED
Terraform rollup rule example
Here's an example of a rollup rule that matches time series with the value
permits_blocked
, while discarding any labels matching instance
and job
.
It uses a counter type metric, and aggregates as a sum using a 30-second interval.
resource "chronosphere_rollup_rule" "permits_blocked_without_instance" {
name = "permits blocked without instance"
slug = "permits_blocked_without_instance"
filter = "__name__:permits_blocked"
expansive_match = "false"
metric_type = "COUNTER"
aggregation = "SUM"
interval = "30s"
exclude_by = ["instance", "job"]
mode = "ENABLED"
new_metric = "{{ .MetricName }}:without_instance"
}
Delete a rollup rule
Delete rollup rules with Chronoctl with the chronoctl rollup-rules delete
command. Provide the slugs of the rules to delete.
chronoctl rollup-rules delete SLUG
For example, to delete the http_request_duration_by_service_and_status
rule,
use this command:
chronoctl rollup-rules delete http_request_duration_by_service_and_status
If your slug starts with a dash (-
), use double quotes ("
) around the slug
name.
chronoctl rollup-rules delete "/-my-rollup-rule"