Rollup rules

Create rollup rules to downsample and aggregate metrics before they're stored. Rollup rules are a type of aggregation rule that help you reduce the cardinality footprint of your metrics by dropping raw data to eliminate unneeded labels. High cardinality footprints can cause slow dashboards and queries.

If you're working with late-arriving data, rollup rules are well suited for ensuring all of your data aggregates the way you need it.

As an example, instance or pod labels don't often add value on their own, but removing these labels from the client side isn't always possible. You can use rollup rules to avoid storing these labels.

Rollup rules support both Prometheus and Graphite metrics.

View rollup rules

In the Chronosphere app, view rollup rules in the Aggregation rules UI.

To list only rollup rules on the command line, use Chronoctl. The command chronoctl rollup-rules list returns all rollup rules.

For information about viewing, copying, or downloading rule configurations, see Rule configuration.

Create rollup rules

You can apply rollup rules by using either Chronoctl or Terraform. Chronosphere doesn't limit the number of rollup rules a system can have.

Fields for rollup rules

See the Rollup rules API (opens in a new tab) documentation for the full list of fields.

Each rollup rule must declare the type of metric it aggregates by setting the metric_type field, because each metric type aggregates differently. Choosing the wrong metric_type for your rule can produce unexpected results.

  • aggregation: Specifies how to combine the grouped metrics. See the supported aggregation operations for specific help.

  • metric_type: The metric type aggregated. Valid options vary depending on your data. See the supported aggregation operations for specific help.

  • filter: Specifies the names of the metrics the rule matches. Filters can include both Prometheus and Graphite metrics.

    For example:

    filter: __name__:coordinator_http_handler_http_handler_latency_count chronosphere_k8s_namespace:test

    Label filters can include multiple labels. Metrics must match all specified labels for the filter to apply. Label values support glob patterns, including matching multiple patterns with an OR, such as k8s_pod:{name1,name2}.

    The __m3_prom_type__ label prefix gives you access to Prometheus (opens in a new tab) and StatsD (opens in a new tab) metric types.

  • A label policy: Label policies act as a filter, defining which labels to preserve in the resulting metric. Use group_by to keep one or more labels, or exclude_by to ignore one or more labels. (Known as keep and discard in the Aggregation Rules UI .

  • label_replace (Graphite only, required): Specifies label and value pairs where the replacement label's current value replaces the existing value. Optional regular expressions and templating are available for label replacement.

    type LabelReplace struct {
      // The source label name
      SrcLabelName string
      // The new value for the label with name SrcLabelName. This supports regex expansion
      // if a LabelValueRegex is supplied, and is used as a literal otherwise.
      NewLabelValue string
      // The optional regex that should match label value. If this regex is provided, it
      // needs to match the value of the label SrcLabelName to apply the replace. This
      // supports capture groups to extract parts of the label value.
      // If omitted, NewLabelValue is used as a literal to replace any SrcLabelName label
      // value.
      LabelValueRegex string
    }
    • SrcLabelName: The source label name.
    • NewLabelValue: The new value for the label specified by SrcLabelName. This supports regular expressions, or is a literal replacement.
    • LabelValueRegex: An optional regular expression, supporting capture groups.
  • new_metric Optional for Graphite. The name of the new metric to create and persist to the database. You can use the template string {{ .MetricName }} in the new_metric to create a new metric name that references the original metric name. For instance, new_metric: '{{ .MetricName }}:by_instance' would output a metric with the name my_metric:by_instance if the matched metric was my_metric.

  • interval: The distance in time between aggregated data points. Intervals are based on your retention policy. Use this optional field to set a custom interval. (Known as storage_policies in version 0.286.0-2023-01-06-release.1 and earlier.)

  • drop_raw: Defaults to false. Set to true to remove raw metrics that match this rollup rule. For more information, see Mapping rules.

Label policies

You define which labels to preserve in the resulting metric through the use of label policies. To do this, add the appropriate field to the rollup rule definition:

Only one of group_by or exclude_by can be set per rollup rule.

Graphite metrics support only the exclude_by rule type.

  • group_by (keep in the Aggregation Rules UI)

    When using group_by rollup rules, you must specify the labels by which to aggregate the metrics. The rule aggregates only metrics that contain all of the keep labels. group_by retains only the selected labels and discards any other labels. If a metric doesn't include all of the labels specified by group_by, the metric isn't included in the rule.

    Use a group_by rule when there are individual metrics you can filter with the __name__ label that you want to aggregate.

  • exclude_by (discard in the Aggregation Rules UI)

    With an exclude_by rollup rule, you specify which labels to remove from the aggregated metric, while keeping all other labels.

    Use an exclude_by rule when you want to target a group of metrics for a particular service, team, or other higher level set of metrics.

Best practices for rule creation

Following these guidelines helps ensure your rollup rules work as intended:

  • Before using a rollup rule to group labels, be sure those labels aren't used in other places, such as dashboards, monitors, or the queries you use to debug issues.
  • Filters using curly braces ({}) shouldn't use a dash (-) in the filter for label names. The single filter identifies this as a range. For example, service_cluster: !{my-label} fails. Rewrite the filter to service_cluster:!human-label instead.

Create rollup rule configurations in the Chronosphere app from either the Aggregation rules UI or the Metrics usage analyzer.

When creating a rule configuration, the Visual Editor displays by default. When creating a rule in Metrics Analyzer, the dialog pre-populates fields based on the user's selected data.

To create a rule configuration:

  1. Enter or edit data for the following fields:
    • Rule name: Add or edit the name of the rule.
    • Rule mode: Either Rule Preview or Rule Enabled.
    • Matching Time Series: Time series the rule applies to. Comma separated, and supports glob syntax. Add a Label, a function (= or !=), and a Value. Click Add to add another time series.
    • Labels to Roll Up: Discard Labels, or Keep Labels. Add labels to the Input Labels text box.
    • Output Metric: The new metric's name and aggregation configuration.
      • Output Metric Name: Edit the output metric name. Clear the checkbox for Include metric name to remove the original name.
      • Output Metric Type: Select a metric type.
      • Aggregation: Select an aggregation operation.
      • Sample Interval: The length of time between samples.
    • Raw Data: Select the toggle to drop the raw input data after aggregation.
  2. When finished, click Code Config.
  3. Choose your rule creation method from these options:
    • Chronoctl
    • Terraform
    • API
  4. Apply the changes based on your selected method.

Rollup rules take effect immediately, but can require a full recording interval to show a change.

Supported aggregation operations

Some operations can change the type of the metric during aggregation. The resulting metric type of an aggregation is called the output metric type.

Even if you are ingesting data with the wrong metric type, you should configure your rollup rule with the metric type that the ingested data should be. For example, if Chronosphere ingests metrics with type GAUGE, but the values actually represent DELTA_COUNTER, you should use a metric_type=DELTA_COUNTER rollup rule to aggregate them.

Rollup rules support the following aggregation operations:

CUMULATIVE_COUNTER

Cumulative counters support these aggregations:

  • SUM

    Takes the increase of each individual input series within the configured interval, then sums the increases together by the configured label policy. The output is the cumulative summed increase across all input series.

  • COUNT

    Counts the number of unique input series by the configured label policy (for example, cardinality).

The output type of all cumulative counter aggregations is a CUMULATIVE_COUNTER.

Refer to Metric types - Counters for best practices when using counters.

GAUGE

Gauges support the following aggregation methods:

  • SUM

    Takes the last value of each individual input series within the configured interval, then sums all final values together by the configured label policy.

  • COUNT

    Counts the number of unique input series by the configured label policy (for example, cardinality).

  • MIN

    Takes the minimum value of all data points within the configured interval across all series matched by the configured label policy.

    The output gauge is downsampled in long-term storage by taking the minimum value of each downsample window, preserving the minimum value.

  • MAX

    Takes the maximum value of all data points within the configured interval across all series matched by the configured label policy.

    The output gauge is downsampled in long-term storage by taking the maximum value of each downsample window, preserving the maximum value.

  • PXX, MEAN, MEDIAN, STDEV, SUMSQ

    Takes the maximum value of each individual input series within the configured interval, then computes the desired value distribution.

The output type of all gauge aggregations is a GAUGE.

When querying a gauge metric with a range vector included in the query, downsampling might impact the accuracy of the query result. Most use cases that fit this criteria can be converted to use counters instead, which avoids the issue.

DELTA_COUNTER

Supported aggregations:

  • SUM

    Sums all values of all series by the configured label policy. All values must be nonnegative.

  • COUNT

    Counts the number of unique input series by the configured label policy (for example, cardinality).

  • COUNT_SAMPLES

    Counts the number of input samples by the configured label policy.

The output type of all delta counter aggregations is a cumulative counter.

Exceptions for DELTA_COUNTER metrics

DELTA_COUNTER metrics don't require the following fields for rollup rules:

  • name
  • aggregation
  • keep
  • discard

MEASUREMENT

Measurements support the following aggregation methods:

  • SUM

    Sums all values of all series by the configured label policy. The output metric type is a CUMULATIVE_COUNTER.

  • COUNT

    Counts the number of unique input series by the configured label policy (for example,cardinality). The output metric type is a CUMULATIVE_COUNTER.

  • COUNT_SAMPLES

    Counts the number of input samples by the configured label policy. The output metric type is a CUMULATIVE_COUNTER.

    Takes the last value of all samples by the configured label policy.The output metric type is a GAUGE.

  • MIN

    Takes the minimum value of all samples by the configured label policy. The output metric type is a GAUGE.

    The output gauge is downsampled in long-term storage by taking the minimum value of each downsample window, which preserves the minimum value.

  • MAX

    Takes the maximum value of all series by the configured label policy. The output metric type is a GAUGE.

    The output gauge is downsampled in long-term storage by taking the maximum value of each downsample window, preserving the maximum value.

  • PXX, MEAN, MEDIAN, STDEV, SUMSQ

    Computes the desired value distribution by label policy. The output metric type is a GAUGE.

Delete rollup rules

Delete rollup rules with Chronoctl with the chronoctl rollup-rules delete command. Provide the slugs of the rules to delete by using the slug option.

For example, to delete the http_request_duration_by_service_and_status rule, use this command:

chronoctl rollup-rules delete slug http_request_duration_by_service_and_status

If your slug starts with a dash (-), use double quotes (") around the slug name.

chronoctl rollup-rules delete "/-my-rollup-rule"