Manage v2 alerts with Chronoctl
If you use the deprecated v2 version of alerts, you can create and manage those alerts only with Chronoctl v0.x.x. See Download and install a v0.x.x version
Alerts created with v2 definitions are deprecated, and Chronoctl v1.0.0 removes support for creating and managing them. For a supported alerting mechanism, create and use monitors.
Manage alerts and alert groups
All alerts must belong to an alert group. An alert group includes the name of the group, an escalation policy, and a list of alert rules. Alerts themselves include the name of the alert, the PromQL expression to evaluate, thresholds for the alert, and other additional items.
# bucket_slug assigns the list of alert rules to a specific bucket.
bucket_slug: edge-proxy
# alerts is a list of alert rules.
alerts:
# rules contain the list of individual alerts.
rules:
# expr is the PromQL expression used in the query.
- expr: sum(rate(fetch_errors{code="4XX"}[1m]))
name: Errors per second
# sustain is the period that needs to be met before
# the alert gets into a firing state.
sustain: 180s
# thresholds define the tresholds for the alerts.
thresholds:
# name can be either `warn` or `critical`. Examples of each follow this line.
- name: warn
# op can be either GEQ, GT, LEQ, or LT.
op: GEQ
value: 0.001
# sustain is the period that needs to be met before
# the alert gets into a firing state.
sustain: 60s
- name: critical
op: GEQ
value: 1
sustain: 180s
# annotations specify a set of informational labels that can be
# used to store longer additional information such as alert descriptions
# or runbook links. The annotation values can be templated.
annotations:
- name: description
# $value and $labels come from the metric that is returned in the alerting
# engine.
value: |-
Fetch errors is being exceeded
VALUE = {{ $value }}
LABELS: {{ $labels }}
- name: summary
value: Fetch errors (instance {{ $labels.instance }})
Define escalation policies and thresholds
Each alert's policy
includes escalation policy definitions to use when determining
the notifiers to trigger if an alert fires. Escalation policies can have multiple
rules, and each rule contains a set of notifiers and a list of matchers against
which to compare a given alert. Use the notifiers
section to define your notifiers.
Use matchers
to determine which notifiers to trigger based on the thresholds defined
on the alert. For example, in the previous alert, there is a warn
and critical
threshold. In the following policies snippet, there are two rules, one for critical
and one for warn
, that map to the thresholds.
# policy define the different alerting policies.
# Each policy can have multiple notifiers.
policy:
- rules:
# notifiers must be one of the notifiers defined in the `notifiers` stanza.
- notifiers:
- test-pagerduty
# matchers define what labels to associate with the notifier.
matchers:
# type is always EXACT_MATCHER_TYPE.
- type: EXACT_MATCHER_TYPE
# name is always `severity`.
name: severity
# value is either `warn` or `critical`.
value: critical
- notifiers:
- test-email
matchers:
- type: EXACT_MATCHER_TYPE
name: severity
value: warn
Define downtime periods
You can set a downtime to silence alerts for a specified period of time, for example during scheduled maintenance windows or when deploying new services. To define a downtime, provide a list of labels and values that correspond to either alert or metric labels.
For example:
-
Downtime all alerts for
customer-a
in a particularregion
:customer: customer-a region: us-east-1
-
Downtime a specific alert in a particular namespace:
alertname: Container Memory Utilization High namespace: kube-system
Schedule alert evaluation periods
You can specify the hours or days in which Chronosphere evaluates an alert by setting an alert schedule. You can set schedules only by using Chronoctl or the Terraform provider. If you don't set a schedule, the alert monitor is always active.
This Chronoctl alert definition example includes a schedule
that runs the
alert monitor only between 12:00 and 24:00 UTC every weekday:
- name: Infra example alert
expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) by (app, grpc_service, grpc_method)
schedule:
timezone: UTC
weekly_schedule:
monday:
ranges:
- end: "24:00"
start: "12:00"
tuesday:
ranges:
- end: "24:00"
start: "12:00"
wednesday:
ranges:
- end: "24:00"
start: "12:00"
thursday:
ranges:
- end: "24:00"
start: "12:00"
friday:
ranges:
- end: "24:00"
start: "12:00"