OBSERVABILITY PLATFORM
Rollup rules

Rollup rules

To downsample and aggregate metrics after they're sent by the client but before they're stored, create rollup rules.

Rollup rules are a type of aggregation rule that help you reduce the cardinality footprint of your metrics by dropping raw data to eliminate unneeded labels. High cardinality footprints can cause slow dashboards and queries.

If you're working with late-arriving data, rollup rules are well suited for ensuring all of your data aggregates the way you need it.

As an example, instance or pod labels don't often add value on their own, but removing these labels from the client side isn't always possible. You can use rollup rules to avoid storing these labels.

Rollup rules support both Prometheus and Graphite metrics.

Rollup rule attributes

To accurately aggregate your data, rollup rules require you to both configure multiple fields and to have an understanding of aggregation operations.

Fields for rollup rules

The following fields are part of the rollup_rule object that you define when creating a rollup rule with Terraform, Chronoctl, and the CreateRollupRule API.

This list isn't comprehensive. See the Rollup rules API documentation for the complete list of fields.

  • aggregation: Specifies how to combine the grouped metrics. See the supported aggregation operations for specific help.

  • drop_raw: Defaults to false. Set to true to remove raw metrics that match this rollup rule. For more information, see Mapping rules.

  • filter: Filters incoming metrics by label. If multiple label filters are specified, an incoming metric must match every label filter to match the rule. Label values support glob patterns, including matching multiple patterns with an OR, such as service:{svc1,svc2}.

    Several special filters are available for matching metrics by non-label request metadata:

    FilterDescriptionValid values
    __metric_type__Matches on the incoming metric's Observability Platform metric type. This is the recommended method for filtering on metric type.cumulative_counter, delta_counter, gauge, or measurement
    __metric_source__Matches on the incoming metric's source format.carbon, chrono_gcp, dogstatsd, open_metrics, open_telemetry, prometheus, signalfx, statsd, or wavefront
    __m3_prom_type__When ingesting with Prometheus, matches on the incoming metric's Prometheus metric type.counter, gauge, histogram, gauge_histogram, summary, info, state_set, or quantile
    __otel_type__When ingesting with OpenTelemetry, matches on the incoming metric's OpenTelemetry metric type.sum, monotonic_sum, gauge, histogram, exp_histogram, or summary
    __otel_temporality__When ingesting with OpenTelemetry, matches on the incoming metric's OpenTelemetry temporality.delta or cumulative
    __m3_type__DEPRECATED. Matches on the incoming metric's legacy M3 type.counter, gauge, or timer

    Example:

    __metric_type__:cumulative_counter service:gateway __name__:http_requests_*

    This filter matches any cumulative counter metric with a service=gateway label whose metric name starts with http_requests_.

  • expansive_match: A series matches and aggregates only if each label defined by the label_policy.keep or graphite_label_policy.replace filters (respectively) exist in the series. Setting expansive_match to true removes this restriction. Defaults to false.

    If false, a series matches and aggregates only if each label defined by the provided filters and the label_policy.keep or graphite_label_policy.replace settings exist in the series. Defaults to false.

  • interval: The distance in time between aggregated data points. Intervals are based on your retention policy. Use this optional field to set a custom interval. (Known as storage_policies in version 0.286.0-2023-01-06-release.1 and earlier.)

  • A label policy: Label policies act as a filter, defining which labels to preserve in the resulting metric. Use group_by to keep one or more labels, or exclude_by to ignore one or more labels. (Known as keep and discard in the Aggregation Rules UI).

  • metric_type: The metric type aggregated. Valid options vary depending on your data. See the supported aggregation operations for specific help.

    Each rollup rule must declare the type of metric it aggregates by setting the metric_type field, because each metric type aggregates differently.

    ⚠️

    Choosing the wrong metric_type for your rule can produce unexpected results.

  • new_metric (Terraform) | metric_name (Chronoctl, API): The name of the new metric to create and persist to the database. You can use the template string {{.MetricName }} to create a new metric name that references the original metric name. For instance, new_metric: '{{ .MetricName }}:by_instance' outputs a metric with the name my_metric:by_instance if the matched metric is my_metric.

    This field is optional for Graphite rollup rules.

Label policies

You define which labels to preserve in the resulting metric through the use of label policies. To do this, add the appropriate field to the rollup rule definition.

You can set only one of group_by or exclude_by per rollup rule. Graphite metrics support only the exclude_by rule type.

  • group_by (keep in the Aggregation Rules UI)

    When using group_by rollup rules, you must specify the labels by which to aggregate the metrics. The rule aggregates only metrics that contain all of the keep labels. group_by retains only the selected labels and discards any other labels. If a metric doesn't include all of the labels specified by group_by, the metric isn't included in the rule.

    Use a group_by rule when there are individual metrics you can filter with the __name__ label that you want to aggregate.

  • exclude_by (discard in the Aggregation Rules UI)

    With an exclude_by rollup rule, you specify which labels to remove from the aggregated metric, while keeping all other labels.

    Use an exclude_by rule when you want to target a group of metrics for a particular service, team, or other higher level set of metrics.

Set a Graphite label policy

For Graphite metrics, you can use the graphite_label_policy parameter to also set a Graphite-specific label policy. This lets you define replacements for label values without changing their positions, which can reduce cardinality without breaking Graphite metrics' preferred positional indexing.

For example, assume you have raw metric names that follow this pattern:

cluster.production.instance.instance1.requests_count
cluster.production.instance.instance2.requests_count
...

You can create a Graphite label policy that defines a replacement rule that replaces the third positional label name (__g3__) with a new string value (INSTANCE).

This replacement aggregates these metrics as cluster.production.instance.INSTANCE.requests_count, without changing their positional indexing.

The output of the chronoctl rollup-rules scaffold command includes the graphite_label_policy parameter:

api_version: v1/config
kind: RollupRule
spec:
  ...
    graphite_label_policy:
      # Required list of labels to replace. Useful for discarding
      # high-cardinality values while still preserving the original positions of
      # the Graphite metric.
      replace:
        - # Required name of the label whose value should be replaced. Only
          # '__gX__' labels are allowed (aka positional Graphite labels).
          name: <string>
          # Required new value of the replaced label.
          new_value: <string>
  ...

To implement the rule from the example scenario as a Chronoctl YAML resource, define the name and new_value in the list of replace values:

api_version: v1/config
kind: RollupRule
spec:
  ...
    graphite_label_policy:
      replace:
        - name: "__g3__"
          new_value: "INSTANCE"
  ...

Define multiple replacements in a single rollup rule by adding more pairs of name and new_value to the replace list.

Supported aggregation operations

Some operations can change the type of the metric during aggregation. The resulting metric type of an aggregation is called the output metric type.

Even if you are ingesting data with the wrong metric type, configure your rollup rule with the metric type that the ingested data should be. For example, if Chronosphere Observability Platform ingests metrics with type GAUGE, but the values actually represent DELTA_COUNTER, use a metric_type=DELTA_COUNTER rollup rule to aggregate them.

Rollup rules support the following aggregation operations:

CUMULATIVE_COUNTER

Cumulative counters support these aggregations:

  • SUM: Takes the increase of each individual input series within the configured interval, then sums the increases together according to the configured label policy. The output is the cumulative summed increase across all input series.

  • COUNT: Counts the number of unique input series matched by the configured label policy (for example, cardinality).

The output type of all cumulative counter aggregations is a CUMULATIVE_COUNTER.

GAUGE

Gauges support the following aggregation methods:

  • SUM: Takes the max value of each individual input series within the configured interval, then sums all final values together by the configured label policy.

  • COUNT: Counts the number of unique input series matched by the configured label policy (for example, cardinality).

  • MIN: Takes the minimum value of all data points within the configured interval across all series matched by the configured label policy.

  • MAX: Takes the maximum value of all data points within the configured interval across all series matched by the configured label policy.

  • PXX, MEAN, MEDIAN, STDEV, SUMSQ: Takes the maximum value of each individual input series within the configured interval, then computes the desired value distribution.

The output type of all gauge aggregations is a GAUGE.

When querying a gauge metric with a range vector included in the query downsampling might impact the accuracy of the query result. Most use cases that fit this criteria can be converted to use counters instead, which avoids the issue.

DELTA_COUNTER

Supported aggregations:

  • SUM: Sums all values of all series matched by the configured label policy. All values must be nonnegative.

  • COUNT: Counts the number of unique input series matched by the configured label policy (for example, cardinality).

  • COUNT_SAMPLES: Counts the number of input samples matched by the configured label policy.

The output type of all delta counter aggregations is a cumulative counter.

Exceptions for DELTA_COUNTER metrics

DELTA_COUNTER metrics don't require the following fields for rollup rules:

  • name
  • aggregation
  • keep
  • discard

MEASUREMENT

A key feature of MEASUREMENT aggregations lies in how they treat individual samples. Unlike other types such as GAUGE and CUMULATIVE_COUNTER, MEASUREMENT metrics aggregate all at once, across all samples of your matching time series within the aggregated time interval. This enables calculation of accurate statistics server-side, within Observability Platform.

A typical use case for MEASUREMENT aggregations is calculating statistics across raw request latencies across all instances. This can be correctly performed through metric_type=MEASUREMENT and aggregation=P95. Using metric_type=GAUGE in this scenario produces undesired results, discarding all samples except the per-instance max value, then computing the ninety-fifth percentile across these per-instance max values.

Measurements support all aggregation methods:

  • SUM: Sums all values of all series matched by the configured label policy. All values must be nonnegative. The output metric type is a CUMULATIVE_COUNTER.

  • COUNT: Counts the number of unique input series matched by the configured label policy (for example,cardinality). The output metric type is a CUMULATIVE_COUNTER.

  • COUNT_SAMPLES: Counts the number of input samples matched by the configured label policy. The output metric type is a CUMULATIVE_COUNTER.

  • LAST: Takes the last value of all samples matched by the configured label policy. The output metric type is a GAUGE.

  • MIN: Takes the minimum value of all samples matched by the configured label policy. The output metric type is a GAUGE.

  • MAX: Takes the maximum value of all samples matched by the configured label policy. The output metric type is a GAUGE.

  • PXX, MEAN, MEDIAN, STDEV, SUMSQ: Computes the desired value distribution across all samples matched by the configured label policy. The output metric type is a GAUGE.

  • HISTOGRAM: Summarizes the distribution of values as an exponential histogram with a scale of 3. The output type is a CUMULATIVE_EXPONENTIAL_HISTOGRAM.

Supported histograms aggregation operations

If either the input histogram or resulting aggregation exceeds the 160-bucket limit, Observability Platform decreases the exponential histogram scale until the bucket count is within the limit. Downscaling reduces the exponential histogram's resolution.

CUMULATIVE_EXPONENTIAL_HISTOGRAM

Cumulative exponential histogram aggregations operate on OpenTelemetry exponential histograms with cumulative temporality, and on Prometheus native histograms with an exponential bucket layout.

Cumulative exponential histograms support this aggregation method:

DELTA_EXPONENTIAL_HISTOGRAM

Delta exponential histogram aggregations operate on OpenTelemetry exponential histograms with delta temporality.

Delta exponential histograms support this aggregation method:

View rollup rules

Select from the following methods to view rollup rules.

In Observability Platform, view rollup rules in the Aggregation rules UI.

For information about viewing, copying, or downloading rule configurations, see Rule configuration.

Create a rollup rule

Select from the following methods to apply rollup rules. Observability Platform doesn't limit the number of rollup rules a system can have.

If you define a rollup rule using the Observability Platform app, you must download the rule configuration and apply it with one of the supported methods.

Create rollup rule configurations in Observability Platform from the Aggregation rules UI.

When creating a rule configuration, the Visual Editor displays by default. When creating a rule in Metrics Analyzer, the dialog pre-populates fields based on the user's selected data.

To create a rule configuration:

  1. Enter or edit data for the following fields:
    • Rule name: Add or edit the name of the rule.
    • Rule mode: Either Rule Preview or Rule Enabled.
    • Matching Time Series: Time series the rule applies to. Comma-separated, and supports glob syntax. Add a Label, a function (= or !=), and a Value. Click Add to add another time series.
    • Labels to Roll Up: Discard Labels, or Keep Labels. Add labels to the Input Labels text box.
    • Output Metric: The new metric's name and aggregation configuration.
      • Output Metric Name: Edit the output metric name. Clear the checkbox for Include metric name to remove the original name.
      • Input Metric Type: Select a metric type.
      • Aggregation: Select an aggregation operation.
      • Sample Interval: The length of time between samples.
    • Raw Data: Select the toggle to drop the raw input data after aggregation.
  2. When finished, click Code Config.
  3. Choose your rule creation method from these options:
    • Chronoctl
    • Terraform
    • API
  4. Apply the changes based on your selected method.

Rollup rules take effect immediately, but can require a full recording interval to show a change.

Best practices for rule creation

Following these guidelines helps ensure your rollup rules work as intended:

  • Use Live Telemetry Analyzer to verify your glob syntax to ensure your query matches the correct metrics.
  • Before using a rollup rule to group labels, be sure those labels aren't used in other places, such as dashboards, monitors, or the queries you use to debug issues.
  • Filters using curly braces ({}) shouldn't use a dash (-) in the filter for label names. The single filter identifies this as a range. For example, service_cluster: !{my-label} fails. Rewrite the filter to service_cluster:!human-label instead.
  • Metrics can match more than one rule. Matching multiple rules can affect data retention. If a rule matches any `drop_raw=true", raw metrics are dropped.

Chronoctl rollup rule example

Here's an example of a rollup rule that matches time series with the value permits_blocked, while discarding any labels matching instance and job. It uses a counter type metric, and aggregates as a sum using a 30-second interval.

api_version: v1/config
kind: RollupRule
spec:
  slug: permits_blocked_without_instance
  name: permits blocked without instance
  filters:
    - name: __name__
    value_glob: permits_blocked
  expansive_match: false
  metric_name: '{{ .MetricName }}:without_instance'
  metric_type: COUNTER
  aggregation: SUM
  interval: 30s
  label_policy:
    discard:
    - instance
    - job
  mode: ENABLED

Terraform rollup rule example

Here's an example of a rollup rule that matches time series with the value permits_blocked, while discarding any labels matching instance and job. It uses a counter type metric, and aggregates as a sum using a 30-second interval.

resource "chronosphere_rollup_rule" "permits_blocked_without_instance" {
  name      = "permits blocked without instance"
  slug      = "permits_blocked_without_instance"
  filter    = "__name__:permits_blocked"
  expansive_match = "false"
  metric_type = "COUNTER"
  aggregation = "SUM"
  interval = "30s"
  exclude_by  = ["instance", "job"]
  mode      = "ENABLED"
  new_metric  = "{{ .MetricName }}:without_instance"
}

Delete a rollup rule

Delete rollup rules with Chronoctl with the chronoctl rollup-rules delete command. Provide the slugs of the rules to delete.

chronoctl rollup-rules delete SLUG

For example, to delete the http_request_duration_by_service_and_status rule, use this command:

chronoctl rollup-rules delete http_request_duration_by_service_and_status

If your slug starts with a dash (-), use double quotes (") around the slug name.

chronoctl rollup-rules delete "/-my-rollup-rule"