View and create monitors
One of the reasons to ingest and store time series data is to know when data meets or doesn't meet certain criteria. Use Chronosphere Observability Platform alerting to generate alerts and notifications from data, whether it's about your system or about your usage of Observability Platform itself. Compare your monitor configurations to historical data to ensure your thresholds meet your needs.
For an overview of the Observability Platform approach to alerting, see the Introducing Monitors: A better way to manage and interact with alerts (opens in a new tab) Chronosphere blog article.
View available monitors
Select from the following methods to view and filter monitors.
To query and get detailed information about monitors, see Monitor details.
To display a list of defined monitors, in the navigation menu select Alerts > Monitors.
The list of monitors displays the status for each monitor next to its title:
Icon | Description |
---|---|
Currently alerting monitor that exceeds the defined critical conditions. | |
Currently alerting monitor that exceeds the defined warning conditions. | |
Monitor that's currently muted by an active muting rule. | |
Passing monitor that's not generating alerts. |
You can filter your monitors using the following methods:
- Using the Search monitors search box (an OR filter).
- By team, using the Select a team dropdown.
- By owner, using the Select an owner dropdown. The icon indicates the monitor is part of a collection. The icon indicates this monitor is part of a service.
- By notification policy, using the Select a notification policy dropdown.
- By error status.
Monitors with defined signals display the file tree icon. To view the signals from a displayed monitor, click the name of the monitor from the list.
From a monitor's detail page, you can click the name of a signal from the Signals section to filter the query results to alerts only from that signal.
To search for a specific monitor:
- Click the search bar to focus on it, or use the keyboard shortcut Control+K (Command+K on macOS).
- Begin typing any part of the monitor's name.
- Optional: Click the filters for all other listed resource types at the top of the search results to remove them and display only monitor.
- Click the desired search result, or use the arrow keys to select it and press enter, to go to that monitor.
Create a monitor
Select from the following methods to create monitors. Most monitors alert when a value matches a specific condition, such as when an error condition defined by the query lasts longer than one minute.
You can also choose to alert when a value doesn't exist, such as when a host stops
sending metrics and is likely unavailable. This condition triggers only if the
entire monitor query returns no results. For example, to alert on missing or no
data, add a NOT_EXISTS
series condition in the series_conditions
section of the
monitor definition:
series_conditions:
defaults:
critical:
conditions:
- op: NOT_EXISTS
sustain: 60s
To receive alerts when a host stops sending metrics, create a separate monitor for each host and scope the monitor query to that host.
Prerequisites
Before creating a monitor, complete the following tasks:
- Create a notifier to define where to deliver alerts and who to notify.
- Create a notification policy to determine how to route notifications to notifiers based on signals that trigger from your monitor. You select the notifier you created for the critical or warning conditions on the notification policy.
You can then use any of the following methods to create a new monitor.
When creating or editing a monitor in the Web app, you can simulate and test alerts to see how an alert would have performed against historical data. Use backtesting to review how your alert would have performed if it had been defined in the past.
To add a new monitor:
-
In the navigation menu select one of these locations:
- Alerts > Monitors.
- Platform > Collections, and then select the collection you want to create a monitor for. This can be a standard collection or a service.
-
Create the monitor:
- From the Monitors page, click Create monitor.
- From the Collections page, in the Monitors panel, click + Add.
-
Enter the information for the monitor based on its data model.
-
Select an Owner to organize and filter your monitor. You can select a collection or a service.
-
Enter a Monitor Name.
-
Choose a Notification Policy to determine which notification policy to use at a particular alert severity.
-
Enter Labels as key/value pairs to categorize and filter monitors.
-
In the Query section, enter a valid Prometheus or Graphite query.
-
Click Check Query to validate your query and preview query results.
-
Click Open in Explorer to open your query in Metrics Explorer, where you can review your query for syntax errors and make necessary changes.
-
For Prometheus queries, click Edit in Query Builder to open your query in the Query Builder, where you can construct, optimize, and debug your query before saving it. After modifying your query, click Done to return to the Add Monitor page.
-
In the preview, toggle Show thresholds to display the monitor's defined thresholds.
This feature isn't available to all Chronosphere Observability Platform users and might not be visible in your app. For information about enabling this feature in your environment, contact Chronosphere Support.
Prometheus users can test monitor conditions by reviewing when a monitor would have triggered based on historical data. The preview reflects existing monitor schedules, signal grouping, and overrides.
You must define at least one condition for alert simulations to work. Toggle Simulate alerts to backtest your condition against existing data.
Use the Show alert duration toggle to display the time period over which the alert would have been active.
If your selected time period has too many alerts, or the entire graph appears to display in alerted status, reduce the selected time period. If multiple alerts would have fired simultaneously, only one threshold marker displays. The banner shows the correct number of alerts. For example, if a critical and a warning would fire at the same time, only one alert displays on the graph. The banner shows two alerts would have fired.
If your selected query returns too much data, the graph displays an error. Chronosphere recommends selecting shorter time periods for testing, when possible. Alert simulation isn't available outside the raw data retention period.
Select a time range up to the present in the time picker. Alert simulations use existing data, and can't project future alerts.
-
Optional: Group alerts based on the results returned from the query by choosing an option in the Signals section.
If you select per signal (multiple alerts) to generate multiple alerts, enter a label key that differs in name and casing from the label you enter in the Key field in the Labels section. For example, if you enter
environment
in the Key field, you might useEnvironments
as the Label Key to match on. Pinned scopes can be used as a Label Key. -
Define a condition and sustain period (duration of time) in the Conditions section, and assign the resulting alert a severity (warning or critical). In the Sustain field, enter a value followed by an abbreviated unit such as
60s
. Valid units ares
(seconds),m
(minutes),h
(hours), ord
(days). The dialog also displays the notifiers associated with the monitor for reference.To alert on missing or no data, select not exists in the Alert when value dropdown.
-
In the Resolve field, enter a time period for the resolve window as a value followed by an abbreviated unit such as
30s
. Valid units ares
(seconds),m
(minutes),h
(hours), ord
(days). -
Add notes for the monitor in the Annotations section, such as runbooks, links to related dashboards, data links to related traces, and documentation links.
-
Click Save.
Chronosphere recommends a query interval minimum of at least 15 seconds. There can be a ten second delay between an alert trigger and the notifier activation.
Chronoctl monitor example
The following YAML definition consists of one monitor named Disk Getting Full
. The
series_conditions
trigger a warning notification when the disk is 80% full for more
than 300 seconds, and a critical notification when 90% full for more than 300
seconds. It groups series into signals based on the source
and
service_environment
label keys.
The schedule
section indicates that this monitor runs each week on Mondays from
7:00 to 10:10 and 15:00 to 22:30, and Thursdays from 21:15 through the end of the
day. All times are in UTC.
If you define label_names
in the signal_grouping
section, enter a label name that
differs in name and casing from the label you enter in the labels
section. For
example, if you enter environment
as a key in the labels
section, you might use
Environments
in the label_names
section.
api_version: v1/config
kind: Monitor
spec:
# Required name of the monitor. Can be modified after the monitor is created.
name: Disk Getting Full
# PromQL query. If set, you can't set graphite_query.
prometheus_query: max(disk:last{measurement="used_percent"}) by (source, service_environment, region)
# Annotations are visible in notifications generated by this monitor.
# You can template annotations with labels from notifications.
annotations:
key_1: "{{ $labels.job }}"
# Slug of the collection the monitor belongs to.
collection_slug: loadgen
# Optional setting for configuring how often alerts are evaluated.
# Defaults to 60 seconds.
interval_secs: 60
# Labels are visible in notifications generated by this monitor,
# and can be used to route alerts with notification overrides.
labels:
key_1: kubernetes_cluster
# Optional notification policy used to route alerts generated by the monitor.
notification_policy_slug: custom-notification-policy
schedule:
# The timezone of the time ranges.
timezone: UTC
weekly_schedule:
monday:
active: ONLY_DURING_RANGES
# The time ranges that the monitor is active on this day. Required if
# active is set to ONLY_DURING_RANGES.
ranges:
- # End time in the in format "<hour>:<minute>", such as "15:30".
end_hh_mm: "15:00"
# Start time in the in format "<hour>:<minute>", such as "15:30".
start_hh_mm: "10:10"
tuesday:
active: NEVER
wednesday:
active: NEVER
thursday:
active: ONLY_DURING_RANGES
# The time ranges that the monitor is active on this day. Required if
ranges:
# active is set to ONLY_DURING_RANGES.
- # End time in the in format "<hour>:<minute>", such as "15:30".
end_hh_mm: "24:00"
# Start time in the in format "<hour>:<minute>", such as "15:30".
start_hh_mm: "21:15"
friday:
active: NEVER
saturday:
active: NEVER
sunday:
active: NEVER
# Conditions evaluated against each queried series to determine the severity of each series.
series_conditions:
defaults:
critical:
# List of conditions to evaluate against a series.
# Only one condition must match to assign a severity to a signal.
conditions:
# To alert on missing or no data, change the value for `op` to `NOT_EXISTS`.
- op: GT
# How long the op operation needs to evaluate for the condition
# to evaluate to true.
sustain_secs: 300
# The value to compare to the metric value using the op operation.
value: 60
# How long the operation needs to evaluate false to resolve
resolve_sustain: 60
warn:
# List of conditions to evaluate against a series.
# Only one condition must match to assign a severity to a signal.
conditions:
- op: GT
# How long the op operation needs to evaluate for the condition
# to evaluate to true.
sustain_secs: 300
# The value to compare to the metric value using the op operation.
value: 30
# How long the operation needs to evaluate false to resolve
resolve_sustain: 60
# Defines how the set of series from the query are split into signals.
signal_grouping:
label_names:
- source
- service_environment
# If true, each series will have its own signal and label_names can't be set.
signal_per_series: false
Terraform monitor example
The following Terraform resource creates a monitor that Terraform refers to by
infra
, and with a human-readable name of Infra Example monitor
.
The schedule
section runs this monitor each week on Mondays from 7:00 to 10:10 and
15:00 to 22:30, and Thursdays from 21:15 through the end of the day. All times are
UTC, and Observability Platform won't run this monitor during the rest of the week.
If you define label_names
in the signal_grouping
section, enter a label name
that differs in name and casing from the label you enter in the labels
section.
For example, if you enter environment
as a key in the labels
section, you
might use Environments
in the label_names
section.
resource "chronosphere_monitor" "infra" {
name = "Infra Example monitor"
# Reference to the collection the alert belongs to.
collection_id = chronosphere_collection.infra.id
# Override the notification policy.
# By default, uses the policy from the collection_id.
notification_policy_id = chronosphere_collection.infra_testing.id
# Arbitrary set of labels to assign to the alert.
labels = {
"priority" = "sev-1"
}
# Arbitrary set of annotations to include in alert notifications.
annotations = {
"runbook" = "http://default-runbook"
}
# Interval at which to evaluate the monitor, for example 15s, 30s, or 60s.
# Defaults to 60s.
interval = "30s"
query {
# PromQL query to evaluate for the alert.
# Alternatively, you can use graphite_expr instead.
prometheus_expr = "sum (rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) by (app, grpc_service, grpc_method)"
}
# The remaining examples are optional signals specifying how to group the
# series returned from the query.
# No signal_grouping clause = Per monitor
# signal_grouping with label_names set = Per signal for labels set
# signal_grouping with signal_per_series set to true = Per series
signal_grouping {
# Set of labels names used to split series into signals.
# Each unique combination of labels results in its own signal.
label_names = ["app", "grpc_service"]
# As an alternative to label_names, signal_per_series creates an alert for
# every resulting series from the query.
# signal_per_series = true
}
# Container for the conditions determining the severity of each series from the query.
# The highest severity series of a signal determines that signal's severity.
series_conditions {
# Condition assigning a warn threshold for series above a certain threshold.
condition {
# Severity of the condition, which can be "warn" or "critical".
severity = "warn"
# Value to compare against each series from the query result.
# For EXISTS or NOT_EXISTS operators, value must be set to zero or may be omitted.
value = 5.0
# Operator to use when comparing the query result versus the threshold.
# Valid values can be one of GT, LT, LEQ, GEQ, EQ, NEQ, EXISTS, NOT_EXISTS.
op = "GT"
# Amount of time the query needs to fail the condition check before
# an alert is triggered. Must be an integer. Accepts one of s (seconds), m
# (minutes), or h (hours) as units. Optional.
sustain = "240s"
# Amount of time the query needs to no longer fire before resolving. Must be
# an integer. Accepts one of s (seconds), m (minutes), or h (hours) as units.
resolve_sustain = "60s"
}
condition {
severity = "critical"
value = 10.0
op = "GT"
sustain = "120s"
resolve_sustain = "60s"
}
# Multiple optional overrides can be defined for different sets of conditions
# to series with matching labels.
override {
# One or more matchers for labels on a series.
label_matcher {
# Name of the label
name = "app"
# How to match the label, which can be "EXACT_MATCHER_TYPE" or
# "REGEXP_MATCHER_TYPE".
type = "EXACT_MATCHER_TYPE"
# Value of the label.
value = "dbmon"
}
condition {
severity = "critical"
value = 1.0
op = "GT"
sustain = "60s"
}
}
}
# If you define a schedule, Observability Platform evaluates the monitor only during
# the specified time ranges. The monitor is inactive during all unspecified
# time ranges.
# If you define an empty schedule, Observability Platform never evaluates the monitor.
schedule {
# Valid values: Any IANA timezone string
timezone = "UTC"
range {
# Time range for the monitor schedule. Valid values for day can be full
# day names, such as "Sunday" or "Monday".
# Valid time values must be specified in the range of 00:00 to 24:00.
day = "Monday"
start = "07:00"
end = "10:10"
}
range {
day = "Monday"
start = "15:00"
end = "22:30"
}
range {
day = "Thursday"
start = "21:15"
end = "24:00"
}
}
}
Edit a monitor
Select from the following methods to edit monitors.
Users cannot modify Terraform-managed resources in the user interface, with Chronoctl, or by using the API. Learn more.
To edit a monitor:
- In the navigation menu select Alerts > Monitors.
- Click the name of the monitor you want to edit.
- To the right of the monitor's name, click the three vertical dots icon and select Edit monitor. This opens a sidebar where you can edit the monitor's properties.
- Make your edits, and then click Save. Refer to the monitor data model for specific definitions.
Use the Code Config tool
When adding or editing a monitor, you can click the Code Config tab to view code representations of a monitor for Terraform, Chronoctl, and the Chronosphere API. The displayed code also responds to changes you make in the Visual Editor tab.
When you modify monitor properties in the Visual Editor tab, you can click the Code Config tab to immediately see the updated code representations. This ability lets you use the Visual Editor tab to modify monitor properties and view the code representations expressed as Terraform resources, Chronoctl YAML, or API-compatible JSON.
Changes you make in the Visual Editor tab don't take effect until you click Save or apply the code representations using their corresponding tools.
If you manage a monitor using Terraform, you must use Terraform to apply any changes.
Change code representation
To change the code representation format:
-
Click the Code Config tab.
-
Click the format dropdown. The dropdown defaults to the format of the tool that manages the resource, such as Terraform.
-
Select the desired format. You can then take several additional actions:
-
To copy the viewed code representation to your clipboard, click Copy.
-
To save the viewed code representation to a file, click Download.
-
To view a diff of unsaved changes you've made to a monitor, click View Diff. This button is available only if you've changed the monitor in the Visual Editor tab but haven't saved your changes.
This Git-style diff of changes replaces the Copy and Download buttons with a toggle between Unified and Split diff styles, and the View Diff button with a Hide Diff button that returns you to the code representation view.
You can also view unchanged lines in the context of the diff by clicking Expand X lines... inside the diff.
-
Override a monitor alert
You can override the default conditions that define when an alert triggers for a monitor. This override is similar to overriding a notification policy that routes a notification to a notifier other than the specified default.
On a monitor, you can specify a condition override to use a separate threshold for
certain series. For example, a monitor might have a default threshold of >100
but
you specify an override threshold of >50
where the label key/value pair is
cluster=production
.
You can specify any label as a matcher for a monitor condition override. If no override matches the defined conditions, Observability Platform applies the default conditions. Additionally:
- Overrides must specify at least one matcher, and meet every matcher condition to apply the override.
- Observability Platform evaluates overrides in the listed order. When an override matches, the remaining overrides and defaults are ignored.
- Overrides don't inherit any properties from the default conditions. For example, if
the default policy route specifies
warn
andcritical
notifiers but the override specifies onlycritical
notifiers, the notifier doesn't sendwarn
notifications.
Users cannot modify Terraform-managed resources in the user interface, with Chronoctl, or by using the API. Learn more.
To specify a monitor alert override:
- In the navigation menu select Alerts > Monitors.
- Click the name of the monitor you want to specify an override for.
- To the right of the monitor's name, click the three vertical dots icon and select Edit monitor. This opens a sidebar where you can edit the monitor's properties.
- In the Condition Override section, click the plus icon to display the override fields.
- Select Exact or Regex as the matcher type, and enter the key/value pair to match on for the override.
- Select Critical or Warn as the override severity.
- Define the match condition, and enter a value and sustain duration.
- Click Save to apply the override changes.
Delete a monitor
Select from the following methods to delete monitors.
Users cannot modify Terraform-managed resources in the user interface, with Chronoctl, or by using the API. Learn more.
To delete a monitor:
- In the navigation menu select Alerts > Monitors.
- Click the name of the monitor you want to delete.
- To the right of the monitor's name, click the three vertical dots icon and select Edit monitor.
- In the Edit Monitor dialog, click the three vertical dots icon and select Delete.
Use annotations with monitors
Create annotations for monitors that link to dashboards, runbooks, related documents, and trace metrics, which lets you provide direct links for your on-call engineers to help diagnose issues.
You can reference Prometheus Alertmanager variables in annotations with the
{{.VARIABLE_NAME }}
syntax. Annotations can access monitor labels by using
variables with the {{ .CommonLabels.LABEL }}
pattern, and from the alerting metric
with the {{ .Labels.LABEL }}
pattern. In both patterns, replace LABEL
with
the label's name.
To reference labels in Alertmanager variables, you must include those labels in the alerting time series. Otherwise, the resulting notifier won't display any information for the variables you specify.
The following examples include annotations with variables based on a template. See the Alertmanager documentation (opens in a new tab) for a reference list of alerting variables and templating functions.
To add annotations to a monitor:
-
In the Annotations section, add a description for your annotation in the Key field, and text or links in the Value field. For example, you might add the following key/value pairs as annotations:
Key Value summary Instance {{$labels.instance}}
is downdescription Container {{ $labels.namespace }}
/{{ $labels.pod }}
/{{ $labels.container }}
terminated with{{ $labels.reason }}
.runbook http://default-runbook
Connect monitors to services or collections
Services and collections can own monitors, but you can also connect a monitor to any service or collection.
- In the navigation menu select Alerts > Monitors.
- Click the name of the monitor you want to edit.
- To the right of the monitor's name, click the three vertical dots icon and select Manage connections.
- Add or remove a connection.