Derived metrics

Although rollup rules reduce cardinality, they don't solve the problem of computing repeated and complex expressions. Recording rules solve this problem by executing complex expressions repeatedly and then saving them as their own time series values. However, recording rules require both computation resources and storage space. To address this, Chronosphere provides derived metrics.

Derived metrics let you create aliases for queries, effectively giving queries user-friendly names. For example, you can map the alias global:http_server_request_by_path to the query sum by (status, path) (irate(http_server_requests_count{service=~".+"}[1m])) > 0.

Derived metrics execute at query time, which means they incur computing overhead only at query time, during execution. Your query might be complex and time-consuming, but when you're viewing the results, you're almost always working with a subset of data. For example, consider a review of CPU usage for one or a few services out of thousands. By filtering the results to a small number of services, you allow the query to finish sooner.

Derived metrics uses

Derived metrics help reduce overhead because they can:

  • Reduce alert and dashboard complexity
  • Replace recording rules
  • Provide frequently used aliases for queries

Reducing alert and dashboard complexity

You can create canonical queries to standardize dashboards and alerts. For example, if there is a common error rate query that several dashboards share, you can create a derived metric for use with all dashboards.

If you have complex queries you use only in dashboards and alerts with significant filtering, use derived metrics to remove the need to create and store new time series. Derived metrics are executed at query time and don't require extra storage.

Most recording rules can be implemented using derived metrics, but it's important that the results are properly filtered. For example, a query might natively return hundreds of thousands of time series, but in the context in which it's being used, it's probably being filtered by cluster, service, or namespace, which can significantly reduce the number of time series. If the filtering doesn't sufficiently reduce the number of resultant time series, keeping the recording rule can improve performance.

Replacing recording rules

A query outage or failure to execute the recording rule due to a timeout creates a gap in the recording rule's results, because the query must execute to generate the recorded metric.

With derived metrics, Chronosphere reads the underlying data at query time, preventing gaps.

Typically, users query recording rules with specified filters, such as some:recording:rule{label_1="value_1". Instead of executing the recording rule at a set interval, it's more performant to query this data only as you need it.

For example, the following recording rule definition Chronosphere executes the expression defined in expr every 30 seconds across all services (.+):

- bucket_slug: global
  name: global:http_server_request_by_path:irate1m
  slug: http-server-request-by-path
  interval: 30s
  expr: sum by (status, path) (irate(http_server_requests_count{service=~".+"}[1m])) > 0

There's little value in a query against all services, especially when plotted on a graph. Instead, scope the query to a specific service with filters such as global:http_server_request_by_path:irate1m{service="myservice"} to return a smaller and more focused result.

Using derived metrics, you can remove the need for this expensive recording rule and instead map the query global:http_server_request_by_path:irate1m{service="$my_service"} to the query sum by (status, path) (irate(http_server_requests_count{service="$my_service"}[1m])) > 0, which respects all filters.

Frequently used aliases

If you have frequently accessed queries, derived metrics can simplify the creation of dashboards, alerts, and manual queries.

Additionally, many recording rules generated by third-party tools, such as Sloth, generate metrics that don't warrant persisting a new time series. Although these time series are negligible in storage and compute capacity, creating a derived metric is more efficient if you need such a metric.

View your derived metrics

To return a list of all derived metrics, use the Chronoctl command chronoctl derived-metrics list, filtered by their slugs with the --slugs flag.

For example, to list all derived metrics:

chronoctl derived-metrics list

To list derived metrics with slugs slug_name_1 and slug_name_2:

chronoctl derived-metrics list --slugs slug_name_1,slug_name_2

Create a derived metric

Here's a Chronoctl example of a derived metric with two underlying expressions:

api_version: v1/config
kind: DerivedMetric
spec:
  name: my derived metric
  slug: my-derived-metric
  metric_name: test_metric
  description: This is a test derived metric
  queries:
  - query:
      prometheus_expr: scrape_duration_seconds{$job}
      variables:
        - name: job
          default_prometheus_selector: job=~".*"
    selector:
      labels:
      - name: id
        type: EXACT
        value: abc
  - query:
      prometheus_expr: scrape_series_added{$job}
      variables:
        - name: job
          default_prometheus_selector: job=~".*"

Use a selector in your derived metric definition when you want a metric name that's used by different queries based on the selector, or when you want the same derived metric to map to different underlying metrics. This can cause performance issues if you have a large number of id items in use. Chronosphere doesn't support sums across id items. If you have many selectors, a recording rule is often a better option.

If you want to map only a derived metric name to a query, you don't need a selector.

Delete a derived metric

Chronosphere prevents users from modifying Terraform-managed resources in the user interface, with Chronoctl, or by using the API. For details, see the Terraform provider documentation.

To delete a derived metric with Chronoctl, use the chronoctl derived-metrics delete command with the slug of the derived metric you want to delete:

chronoctl derived-metrics delete SLUG_NAME

You can delete more than one metric at a time by providing a comma-separated list of slugs. For example:

chronoctl derived-metrics delete SLUG_NAME_1,SLUG_NAME_2

Replace a recording rule

When you have a complex or slow recording rule, in some cases you can replace the rule with a derived metric.

For example, this recording rule queries for a number of HTTP status codes:

 - record: slo:sli_error:ratio_rate5m
    expr: |  (sum(rate(flask_http_request_duration_seconds_count{job="default/productservice-servicemonitor/0", status=~"(5..|4..)"}[5m])))
      /    (sum(rate(flask_http_request_duration_seconds_count{job="default/productservice-servicemonitor/0"}[5m])))
    labels:
      owner: customersuccess
      repo: chronosphereio/productservice
      sloth_id: productservice-requests-availability
      sloth_service: productservice
      sloth_slo: requests-availability
      sloth_window: 5m
      tier: "2"

Replacing the recording rule with this derived metric can reduce query load:

resource "chronosphere_derived_metric" "slo-error-rate-5m" {
  name     = "slo-error-rate-5m"
  slug     = "slo-error-rate-5m"
  description = "Service Error Rate - 5m"
  metric_name = "slo:sli_error:ratio_rate5m"

  queries {
     query {
        expr = """
         sum(rate(flask_http_request_duration_seconds_count{
             $job,
             status=~"(5..|4..)"
         }[1m]))
         /
         sum(rate(flask_http_request_duration_seconds_count{
             $job
         }[1m]))
       """
   variables {
        name = "job"
        default_selector = "job=~\".*\""
   }
    }
  }
}