Rollup rules

OBSERVABILITY PLATFORM

Rollup rules

To downsample and aggregate metrics after they’re sent by the client but before they’re stored, create rollup rules.

Rollup rules are a type of aggregation rule that help you reduce the cardinality footprint of your metrics by dropping raw data to eliminate unneeded labels. High cardinality footprints can cause slow dashboards and queries.

If you’re working with late-arriving data, rollup rules are well suited for ensuring all of your data aggregates the way you need it.

As an example, instance or pod labels don’t often add value on their own, but removing these labels from the client side isn’t always possible. You can use rollup rules to avoid storing these labels.

Rollup rules support both Prometheus and Graphite metrics.

Rollup rule attributes

To accurately aggregate your data, rollup rules require you to both configure multiple fields and to have an understanding of aggregation operations.

Fields for rollup rules

See the CreateRollupRule API documentation for the complete list of fields that are part of the rollup_rule object that you define when creating a rollup rule with any of the supported methods.

Label policies

Use label policies to define which labels to preserve in the resulting metric. In the rollup rule definition, add the appropriate field to specify which labels to retain or discard.

You can set only one of group_by or exclude_by per rollup rule. Graphite metrics support only the exclude_by rule type.

Keep specified labels

To aggregate only metrics that contain all of the specified labels and discard all other labels, use group_by (Terraform) or keep. When using these rollup rules, you must specify the labels to aggregate the metrics by. If a metric doesn’t include all of the specified labels, the metric isn’t included in the rule.

If a rollup rule uses group_by or keep, the rule will match only metrics with labels that contain these fields, even if the label filters would have matched these metrics.

Remove specified labels

To target a group of metrics for a particular service, team, or other higher-level set of metrics, use exclude_by (Terraform) or discard. When using these rollup rules, you specify which labels to remove from the aggregated metric, while keeping all other labels.

Set a Graphite label policy

For Graphite metrics, you can use the graphite_label_policy parameter to also set a Graphite-specific label policy. This lets you define replacements for label values without changing their positions, which can reduce cardinality without breaking Graphite metrics’ preferred positional indexing.

For example, assume you have raw metric names that follow this pattern:

cluster.production.instance.instance1.requests_count
cluster.production.instance.instance2.requests_count
...

You can create a Graphite label policy that defines a replacement rule that replaces the third positional label name (__g3__) with a new string value (INSTANCE).

This replacement aggregates these metrics as cluster.production.instance.INSTANCE.requests_count, without changing their positional indexing.

The output of the chronoctl rollup-rules scaffold command includes the graphite_label_policy parameter:

api_version: v1/config
kind: RollupRule
spec:
  ...
    graphite_label_policy:
      # Required list of labels to replace. Useful for discarding
      # high-cardinality values while still preserving the original positions of
      # the Graphite metric.
      replace:
        - # Required name of the label whose value should be replaced. Only
          # '__gX__' labels are allowed (aka positional Graphite labels).
          name: <string>
          # Required new value of the replaced label.
          new_value: <string>
  ...

To implement the rule from the example scenario as a Chronoctl YAML resource, define the name and new_value in the list of replace values:

api_version: v1/config
kind: RollupRule
spec:
  ...
    graphite_label_policy:
      replace:
        - name: "__g3__"
          new_value: "INSTANCE"
  ...

Define multiple replacements in a single rollup rule by adding more pairs of name and new_value to the replace list.

Aggregation operations

Some operations can change the type of the metric during aggregation. The resulting metric type of an aggregation is called the output metric type.

Even if you are ingesting data with the wrong metric type, configure your rollup rule with the metric type that the ingested data should be. For example, if Chronosphere Observability Platform ingests metrics with type GAUGE, but the values actually represent DELTA_COUNTER, use a metric_type=DELTA_COUNTER rollup rule to aggregate them.

Rollup rules support the following aggregation operations:

`CUMULATIVE_COUNTER`

Cumulative counters support these aggregations:

SUM: Takes the increase of each individual input series within the configured interval, then sums the increases together according to the configured label policy. The output is the cumulative summed increase across all input series.
COUNT: Counts the number of unique input series matched by the configured label policy (for example, cardinality).

The output type of all cumulative counter aggregations is a CUMULATIVE_COUNTER.

`GAUGE`

Gauges support the following aggregation methods:

SUM: Takes the max value of each individual input series within the configured interval, then sums all final values together by the configured label policy.
COUNT: Counts the number of unique input series matched by the configured label policy (for example, cardinality).
MIN: Takes the minimum value of all data points within the configured interval across all series matched by the configured label policy.
MAX: Takes the maximum value of all data points within the configured interval across all series matched by the configured label policy.
PXX, MEAN, MEDIAN, STDEV, SUMSQ: Takes the maximum value of each individual input series within the configured interval, and then computes the value distribution.

The output type of all gauge aggregations is a GAUGE.

When querying a gauge metric with a range vector included in the query downsampling might impact the accuracy of the query result. Most use cases that fit this criteria can be converted to use counters instead, which avoids the issue.

`DELTA_COUNTER`

Supported aggregations:

SUM: Sums all values of all series matched by the configured label policy. All values must be nonnegative.
COUNT_SAMPLES: Counts the number of input samples matched by the configured label policy.

The output type of all delta counter aggregations is a DELTA_COUNTER.

Exceptions for `DELTA_COUNTER` metrics

DELTA_COUNTER metrics don’t require the following fields for rollup rules:

name
aggregation
keep
discard

`MEASUREMENT`

A key feature of MEASUREMENT aggregations lies in how they treat individual samples. Unlike other types such as GAUGE and CUMULATIVE_COUNTER, MEASUREMENT metrics aggregate all at once, across all samples of your matching time series within the aggregated time interval. This enables calculation of accurate statistics server-side, within Observability Platform.

A typical use case for MEASUREMENT aggregations is calculating statistics across raw request latencies across all instances. This can be correctly performed through metric_type=MEASUREMENT and aggregation=P95. Using metric_type=GAUGE in this scenario produces results you don’t want, discarding all samples except the per-instance max value, then computing the ninety-fifth percentile across these per-instance max values.

Measurements support all aggregation methods:

SUM: Sums all values of all series matched by the configured label policy. All values must be nonnegative. The output metric type is a DELTA_COUNTER.
COUNT_SAMPLES: Counts the number of input samples matched by the configured label policy. The output metric type is a DELTA_COUNTER.
LAST: Takes the last value of all samples matched by the configured label policy. The output metric type is a GAUGE.
MIN: Takes the minimum value of all samples matched by the configured label policy. The output metric type is a GAUGE.
MAX: Takes the maximum value of all samples matched by the configured label policy. The output metric type is a GAUGE.
PXX, MEAN, MEDIAN, STDEV, SUMSQ: Computes the value distribution across all samples matched by the configured label policy. The output metric type is a GAUGE.
HISTOGRAM: Summarizes the distribution of values as an exponential histogram with a scale of 3. The output type is a DELTA_EXPONENTIAL_HISTOGRAM.

Histograms aggregation operations

If either the input histogram or resulting aggregation exceeds the 160-bucket limit, Observability Platform decreases the exponential histogram scale until the bucket count is within the limit. Downscaling reduces the exponential histogram’s resolution.

`CUMULATIVE_EXPONENTIAL_HISTOGRAM`

Cumulative exponential histogram aggregations operate on OpenTelemetry exponential histograms with cumulative temporality, and on Prometheus native histograms with an exponential bucket layout.

Cumulative exponential histograms support this aggregation method:

SUM: Merges input cumulative exponential histograms by the configured label policy. The output metric type is a CUMULATIVE_EXPONENTIAL_HISTOGRAM.

`DELTA_EXPONENTIAL_HISTOGRAM`

Delta exponential histogram aggregations operate on OpenTelemetry exponential histograms with delta temporality.

Delta exponential histograms support this aggregation method:

SUM: Merges input delta exponential histograms by the configured label policy. The output metric type is a DELTA_EXPONENTIAL_HISTOGRAM.

View rollup rules

Select from the following methods to view rollup rules.

In Observability Platform, view rollup rules in the Aggregation rules UI.

For information about viewing, copying, or downloading rule configurations, see Rule configuration.

Create a rollup rule

Select from the following methods to apply rollup rules. Observability Platform doesn’t limit the number of rollup rules a system can have.

If you define a rollup rule using the Observability Platform app, you must download the rule configuration and apply it with one of the supported methods.

Create rollup rule configurations in Observability Platform from the Aggregation rules UI.

When creating a rule configuration, the Visual Editor displays by default. When creating a rule in Metrics Analyzer, the dialog pre-populates fields based on the user’s selected data.

To create a rule configuration:

Enter or edit data for the following fields:
- Rule Name: Add or edit the name of the rule.
- Rule Details: Either Rule Preview or Rule Enabled.
- Matching Time Series: Time series the rule applies to. You must include a Label, operator (= or !=), and a Value. The value you enter maps to the filters section of the CreateRollupRule endpoint.
  
  For example, if you want the rollup rule to match on Prometheus gauge metrics, enter __m3_prom_type__ as the label to match on, and gauge as the value. The resulting filter looks like:
```
__m3_prom_type__ = gauge
```
  Separate multiple values with a comma. You can use glob syntax, including matching multiple patterns with an OR, such as service:{svc1,svc2}. Click Add to add another time series.
- Labels to Roll Up: Discard Labels, or Keep Labels. Add labels to the Input Labels text box.
- Output Metric: The new metric’s name and aggregation configuration.
  - Output Metric Name: Edit the output metric name. Clear the checkbox for Include metric name to remove the original name.
  - Input Metric Type: Select a metric type, which determines how the rollup rule interprets all matching data points. For example, if you select Gauge, the rollup rule interprets all matching data points as that data type, even if the original source isn’t a gauge metric. This behavior means that the metric type you choose doesn’t have to match the data type of the incoming data.
    
    If you want to match the incoming metric to a specific type, enter two matching time series in the rollup rule: one to match the metric, and another to match the metric type. Use __metric_type__ to define the type of metric you want to match on. For example, if you want to match a time series named agg_write_latency that’s a cumulative exponential histogram, define two series that look like:
```
__name__ = agg_write_latency AND __metric_type__ = cumulative_exponential_histogram
```
  - Aggregation: Select an aggregation operation.
  - Sample Interval: The length of time between samples.
- Raw Data: Select the toggle to drop the raw input data after aggregation.
When finished, click Code Config.
Choose your rule creation method from these options:
- Chronoctl
- Terraform
- API
Apply the changes based on your selected method.

Rollup rules take effect immediately, but can require a full recording interval to show a change.

Best practices for rule creation

Following these guidelines helps ensure your rollup rules work as intended:

Use Live Telemetry Analyzer to verify your glob syntax to ensure your query matches the correct metrics.
Before using a rollup rule to group labels, be sure those labels aren’t used in other places, such as dashboards, monitors, or the queries you use to debug issues.
Filters using curly braces ({}) shouldn’t use a dash (-) in the filter for label names. The single filter identifies this as a range. For example, service_cluster: !{my-label} fails. Rewrite the filter to service_cluster:!human-label instead.
Metrics can match more than one rule. Matching multiple rules can affect data retention. If a rule matches any drop_raw=true, raw metrics are dropped.
If a single output series receives more than 10 million unique input series, Observability Platform might stop accepting new input series specified in the rollup rule, which could result in partially aggregated metrics. To avoid this behavior, choose a label policy that writes more output series by removing fewer labels.

Chronoctl rollup rule example

Here’s an example of a rollup rule that matches time series with the value permits_blocked, while discarding any labels matching instance and job. It uses a counter type metric, and aggregates as a sum using a 30-second interval.

api_version: v1/config
kind: RollupRule
spec:
  slug: permits_blocked_without_instance
  name: permits blocked without instance
  filters:
    - name: __name__
    value_glob: permits_blocked
  metric_name: '{{ .MetricName }}:without_instance'
  metric_type: COUNTER
  aggregation: SUM
  interval: 30s
  label_policy:
    discard:
    - instance
    - job
  mode: ENABLED

Terraform rollup rule example

resource "chronosphere_rollup_rule" "permits_blocked_without_instance" {
  name      = "permits blocked without instance"
  slug      = "permits_blocked_without_instance"
  filter    = "__name__:permits_blocked"
  permissive = false
  metric_type = "COUNTER"
  aggregation = "SUM"
  interval = "30s"
  exclude_by  = ["instance", "job"]
  mode      = "ENABLED"
  new_metric  = "{{ .MetricName }}:without_instance"
}

Delete a rollup rule

Delete rollup rules with Chronoctl with the chronoctl rollup-rules delete command. Provide the slugs of the rules to delete.

chronoctl rollup-rules delete SLUG

For example, to delete the http_request_duration_by_service_and_status rule, use this command:

chronoctl rollup-rules delete http_request_duration_by_service_and_status

If your slug starts with a dash (-), use double quotes (") around the slug name.

chronoctl rollup-rules delete "/-my-rollup-rule"

Shaping rules Recording rules

Rollup rules

Rollup rule attributes

Fields for rollup rules

Label policies

Keep specified labels

Remove specified labels

Set a Graphite label policy

Aggregation operations

CUMULATIVE_COUNTER

GAUGE

DELTA_COUNTER

Exceptions for DELTA_COUNTER metrics

MEASUREMENT

Histograms aggregation operations

CUMULATIVE_EXPONENTIAL_HISTOGRAM

DELTA_EXPONENTIAL_HISTOGRAM

View rollup rules

Create a rollup rule

Best practices for rule creation

Chronoctl rollup rule example

Terraform rollup rule example

Delete a rollup rule

`CUMULATIVE_COUNTER`

`GAUGE`

`DELTA_COUNTER`

Exceptions for `DELTA_COUNTER` metrics

`MEASUREMENT`

`CUMULATIVE_EXPONENTIAL_HISTOGRAM`

`DELTA_EXPONENTIAL_HISTOGRAM`