v2 Alerts

Manage v2 alerts with Chronoctl

If you use the deprecated v2 version of alerts, you can create and manage those alerts only with Chronoctl v0.x.x. See Download and install a v0.x.x version

⚠️

Alerts created with v2 definitions are deprecated, and Chronoctl v1.0.0 removes support for creating and managing them. For a supported alerting mechanism, create and use monitors.

Manage alerts and alert groups

All alerts must belong to an alert group. An alert group includes the name of the group, an escalation policy, and a list of alert rules. Alerts themselves include the name of the alert, the PromQL expression to evaluate, thresholds for the alert, and other additional items.

# bucket_slug assigns the list of alert rules to a specific bucket.
bucket_slug: edge-proxy
# alerts is a list of alert rules.
alerts:
  # rules contain the list of individual alerts.
  rules:
    # expr is the PromQL expression used in the query.
    - expr: sum(rate(fetch_errors{code="4XX"}[1m]))
      name: Errors per second
      # sustain is the period that needs to be met before
      # the alert gets into a firing state.
      sustain: 180s
      # thresholds define the tresholds for the alerts.
      thresholds:
        # name can be either `warn` or `critical`. Examples of each follow this line.
        - name: warn
          # op can be either GEQ, GT, LEQ, or LT.
          op: GEQ
          value: 0.001
          # sustain is the period that needs to be met before
          # the alert gets into a firing state.
          sustain: 60s
        - name: critical
          op: GEQ
          value: 1
          sustain: 180s
      # annotations specify a set of informational labels that can be
      # used to store longer additional information such as alert descriptions
      # or runbook links. The annotation values can be templated.
      annotations:
        - name: description
          # $value and $labels come from the metric that is returned in the alerting
          # engine.
          value: |-
            Fetch errors is being exceeded
              VALUE = {{ $value }}
              LABELS: {{ $labels }}
        - name: summary
          value: Fetch errors (instance {{ $labels.instance }})

Define escalation policies and thresholds

Each alert's policy includes escalation policy definitions to use when determining the notifiers to trigger if an alert fires. Escalation policies can have multiple rules, and each rule contains a set of notifiers and a list of matchers against which to compare a given alert. Use the notifiers section to define your notifiers.

Use matchers to determine which notifiers to trigger based on the thresholds defined on the alert. For example, in the previous alert, there is a warn and critical threshold. In the following policies snippet, there are two rules, one for critical and one for warn, that map to the thresholds.

# policy define the different alerting policies.
# Each policy can have multiple notifiers.
policy:
  - rules:
      # notifiers must be one of the notifiers defined in the `notifiers` stanza.
      - notifiers:
          - test-pagerduty
        # matchers define what labels to associate with the notifier.
        matchers:
          # type is always EXACT_MATCHER_TYPE.
          - type: EXACT_MATCHER_TYPE
            # name is always `severity`.
            name: severity
            # value is either `warn` or `critical`.
            value: critical
      - notifiers:
          - test-email
        matchers:
          - type: EXACT_MATCHER_TYPE
            name: severity
            value: warn

Define downtime periods

You can set a downtime to silence alerts for a specified period of time, for example during scheduled maintenance windows or when deploying new services. To define a downtime, provide a list of labels and values that correspond to either alert or metric labels.

For example:

  • Downtime all alerts for customer-a in a particular region:

    customer: customer-a
    region: us-east-1
  • Downtime a specific alert in a particular namespace:

    alertname: Container Memory Utilization High
    namespace: kube-system

Schedule alert evaluation periods

You can specify the hours or days in which Chronosphere evaluates an alert by setting an alert schedule. You can set schedules only by using Chronoctl or the Terraform provider. If you don't set a schedule, the alert monitor is always active.

This Chronoctl alert definition example includes a schedule that runs the alert monitor only between 12:00 and 24:00 UTC every weekday:

  - name: Infra example alert
    expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) by (app, grpc_service, grpc_method)
    schedule:
      timezone: UTC
      weekly_schedule:
        monday:
          ranges:
            - end: "24:00"
              start: "12:00"
        tuesday:
          ranges:
            - end: "24:00"
              start: "12:00"
        wednesday:
          ranges:
            - end: "24:00"
              start: "12:00"
        thursday:
          ranges:
            - end: "24:00"
              start: "12:00"
        friday:
          ranges:
            - end: "24:00"
              start: "12:00"