OBSERVABILITY PLATFORM

Monitors

Create monitors to generate alerts and notifications

One of the reasons to ingest and store time series data is to know when data meets or doesn’t meet certain criteria. Use Chronosphere Observability Platform alerting to generate alerts and notifications from data, whether it’s about your system or about your usage of Observability Platform itself. Compare your monitor configurations to historical data to ensure your thresholds meet your needs.

View available monitors

Select from the following methods to view and filter monitors.

To query and get detailed information about monitors, see Monitor details.

To display a list of defined monitors, in the navigation menu select Alerts > Monitors.

The list of monitors displays the status for each monitor next to its title:

Icon	Description
	Currently alerting monitor that exceeds the defined critical conditions.
	Currently alerting monitor that exceeds the defined warning conditions.
	Monitor that’s currently muted by an active muting rule.
	Passing monitor that’s not generating alerts.

Use the following methods to filter your monitors:

Using the Search monitors search box (an OR filter).
By team, using the Select a team dropdown.
By owner, using the Select an owner dropdown. The icon indicates the monitor is part of a collection. The icon indicates this monitor is part of a service.
By notification policy, using the Select a notification policy dropdown.
By error status.

Monitors with defined signals display the file tree icon. To view the signals from a displayed monitor, click the name of the monitor from the list.

From a monitor’s detail page, click the name of a signal from the Signals section to filter the query results to alerts only from that signal.

To search for a specific monitor:

Click the search bar to focus on it, or use the keyboard shortcut Control+K (Command+K on macOS).
Begin typing any part of the monitor’s name.
Optional: Click the filters for all other listed resource types at the top of the search results to remove them and display only monitor.
Click the search result you’re interested in, or use the arrow keys to select it and press enter, to go to that monitor.

Create a monitor

Most monitors alert when a value matches a specific condition, such as when an error condition defined by the query lasts longer than one minute.

You can also choose to alert when a value doesn’t exist, such as when a host stops sending metrics and is likely unavailable. This condition triggers only if the entire monitor query returns no results. For example, to alert on missing or no data, add a NOT_EXISTS series condition in the series_conditions section of the monitor definition:

series_conditions:
  defaults:
    critical:
      conditions:
        - op: NOT_EXISTS
          sustain: 60s

To receive alerts when a host stops sending metrics, create a separate monitor for each host and scope the monitor query to that host.

Prerequisites

Before creating a monitor, complete the following tasks:

Create a notifier to define where to deliver alerts and who to notify.
Create a notification policy to determine how to route notifications to notifiers based on signals that trigger from your monitor. You select the notifier you created for the critical or warning conditions on the notification policy.

Create monitors

After completing the prerequisite tasks, use any of the following methods to create a new monitor.

When creating or editing a monitor in Observability Platform, you can simulate and test alerts to see how an alert would have performed against historical data. Use backtesting to review how your alert would have performed if it had been defined in the past.

Chronosphere recommends a query interval minimum of at least 15 seconds. There can be a ten second delay between an alert trigger and the notifier activation.

You can create a monitor using one of the following procedures, or you can duplicate an existing monitor.

To add a new monitor:

In the navigation menu select one of these locations:
- Alerts > Monitors.
- Platform > Collections, and then select the collection you want to create a monitor for. This can be a standard collection or a service.
Create the monitor:
- From the Monitors page, click Create monitor. You can also choose Duplicate monitor to copy an existing monitor.
- From the Collections page, in the Monitors panel, click + Add.
Enter the information for the monitor based on its data model.
Select an Owner to organize and filter your monitor. You can select a collection or a service.
Enter a Monitor Name.
Choose a Notification Policy to determine which notification policy to use at a particular alert severity.
Enter Labels as key/value pairs to categorize and filter monitors.
In the Query section, choose the type of query you want to enter:
- Prometheus: Enter a valid Prometheus query. Click Edit in Query Builder to open your query in the Query Builder, where you can construct, optimize, and debug your query before saving it. After modifying your query, click Done to return to the Add Monitor page.
- Graphite: Enter a valid Graphite query.
- Logs: Enter a valid log query, which must include the make-series operator with a specified step size to return data. This operator uses the count() function by default, but you can specify different operators instead.
  
  For example, the following query creates a time chart that includes the average for latencyInSeconds. The step parameter defines the time step for each bucket in Prometheus time duration format (opens in a new tab):
```
severity = "WARNING"
| make-series avg(latencyInSeconds) step 15m by severity, service
```
  If the log query includes a field that contains a period in its name and you want to use signals to group notifications, use an alias for that field name. Otherwise, periods are converted to underscores in the generated visualization.
Click Check Query to validate your query and preview query results.
Click Open in Explorer to open your query in Metrics Explorer, where you can review your query for syntax errors and make necessary changes.
In the preview, toggle Show thresholds to display the monitor’s defined thresholds.

Prometheus users can test monitor conditions by reviewing when a monitor would have triggered based on historical data. The preview reflects existing monitor schedules, signal grouping, and overrides.

You must define at least one condition for alert simulations to work. Toggle Simulate alerts to backtest your condition against existing data.

Use the Show alert durations toggle to display the time period over which the alert would have been active.

If your selected time period has too many alerts, or the entire graph appears to display in alerted status, reduce the selected time period. If multiple alerts would have fired simultaneously, only one threshold marker displays. The banner shows the correct number of alerts. For example, if a critical and a warning would fire at the same time, only one alert displays on the graph. The banner shows two alerts would have fired.

If your selected query returns too much data, the graph displays an error. Chronosphere recommends selecting shorter time periods for testing, when possible. Alert simulation isn’t available outside the raw data retention period.

Select a time range up to the present in the time range selector. Alert simulations use existing data, and can’t project future alerts.
Optional: Group alerts based on the results returned from the query by choosing an option in the Signals section.

If you select per signal (multiple alerts) to generate multiple alerts, enter a label key that differs in name and casing from the label you enter in the Key field in the Labels section. For example, if you enter environment in the Key field, you might use Environments as the Label Key to match on. Pinned scopes can be used as a Label Key.
Define a condition and sustain period (duration of time) in the Conditions section, and assign the resulting alert a severity (warning or critical). In the Sustain field, enter a value followed by an abbreviated unit such as 60s. Valid units are s (seconds), m (minutes), h (hours), or d (days). The dialog also displays the notifiers associated with the monitor for reference.

To alert on missing or no data, select not exists in the Alert when value dropdown.
In the Resolve field, enter a time period for the resolve window as a value followed by an abbreviated unit such as 30s. Valid units are s (seconds), m (minutes), h (hours), or d (days).
Add notes for the monitor in the Annotations section, such as runbooks, links to related dashboards, data links to related traces, and documentation links.
Click Save.

Chronoctl examples

Use one of the following examples to understand the monitor structure for a Chronoctl definition.

The following YAML definition consists of one monitor named Disk Getting Full. The series_conditions trigger a warning notification when the disk is 80% full for more than 300 seconds, and a critical notification when 90% full for more than 300 seconds. It groups series into signals based on the source and service_environment label keys.

The schedule section indicates that this monitor runs each week on Mondays from 7:00 to 10:10 and 15:00 to 22:30, and Thursdays from 21:15 through the end of the day. All times are in UTC.

If you define label_names in the signal_grouping section, enter a label name that differs in name and casing from the label you enter in the labels section. For example, if you enter environment as a key in the labels section, you might use Environments in the label_names section.

api_version: v1/config
kind: Monitor
spec:
    # Required name of the monitor. Can be modified after the monitor is created.
    name: Disk Getting Full
    # PromQL query. If set, you can't set graphite_query.
    prometheus_query: max(disk:last{measurement="used_percent"}) by (source, service_environment, region)
    # Annotations are visible in notifications generated by this monitor.
    # You can template annotations with labels from notifications.
    annotations:
        key_1: "{{ $labels.job }}"
    # Slug of the collection the monitor belongs to.
    collection_slug: loadgen
    # Optional setting for configuring how often alerts are evaluated.
    # Defaults to 60 seconds.
    interval_secs: 60
    # Labels are visible in notifications generated by this monitor,
    # and can be used to route alerts with notification overrides.
    labels:
        key_1: kubernetes_cluster
    # Optional notification policy used to route alerts generated by the monitor.
    notification_policy_slug: custom-notification-policy
    schedule:
        # The timezone of the time ranges.
        timezone: UTC
        weekly_schedule:
            monday:
                active: ONLY_DURING_RANGES
                # The time ranges that the monitor is active on this day. Required if
                # active is set to ONLY_DURING_RANGES.
                ranges:
                    - # End time in the in format "<hour>:<minute>", such as "15:30".
                      end_hh_mm: "15:00"
                      # Start time in the in format "<hour>:<minute>", such as "15:30".
                      start_hh_mm: "10:10"
            tuesday:
                active: NEVER
            wednesday:
                active: NEVER
            thursday:
                active: ONLY_DURING_RANGES
                # The time ranges that the monitor is active on this day. Required if
                ranges:
                # active is set to ONLY_DURING_RANGES.
                    - # End time in the in format "<hour>:<minute>", such as "15:30".
                      end_hh_mm: "24:00"
                      # Start time in the in format "<hour>:<minute>", such as "15:30".
                      start_hh_mm: "21:15"
            friday:
                active: NEVER
            saturday:
                active: NEVER
            sunday:
                active: NEVER
    # Conditions evaluated against each queried series to determine the severity of each series.
    series_conditions:
        defaults:
            critical:
                # List of conditions to evaluate against a series.
                # Only one condition must match to assign a severity to a signal.
                conditions:
                   # To alert on missing or no data, change the value for `op` to `NOT_EXISTS`.
                    - op: GT
                    # How long the op operation needs to evaluate for the condition
                    # to evaluate to true.
                    sustain_secs: 300
                    # The value to compare to the metric value using the op operation.
                    value: 60
                    # How long the operation needs to evaluate false to resolve
                    resolve_sustain: 60
            warn:
                # List of conditions to evaluate against a series.
                # Only one condition must match to assign a severity to a signal.
                conditions:
                    - op: GT
                    # How long the op operation needs to evaluate for the condition
                    # to evaluate to true.
                    sustain_secs: 300
                    # The value to compare to the metric value using the op operation.
                    value: 30
                    # How long the operation needs to evaluate false to resolve
                    resolve_sustain: 60
    # Defines how the set of series from the query are split into signals.
    signal_grouping:
        label_names:
            - source
            - service_environment
            # If true, each series will have its own signal and label_names can't be set.
        signal_per_series: false

Terraform examples

Use one of the following examples to understand the monitor structure for a Terraform resource.

The following Terraform resource creates a monitor that Terraform refers to by infra, and with a human-readable name of Infra Example monitor.

The schedule section runs this monitor each week on Mondays from 7:00 to 10:10 and 15:00 to 22:30, and Thursdays from 21:15 through the end of the day. All times are UTC, and Observability Platform won’t run this monitor during the rest of the week.

resource "chronosphere_monitor" "infra" {
  name = "Infra Example monitor"
 
  # Reference to the collection the alert belongs to.
  collection_id = chronosphere_collection.infra.id
 
  # Override the notification policy.
  # By default, uses the policy from the collection_id.
  notification_policy_id = chronosphere_collection.infra_testing.id
 
  # Arbitrary set of labels to assign to the alert.
  labels = {
    "priority" = "sev-1"
  }
 
  # Arbitrary set of annotations to include in alert notifications.
  annotations = {
    "runbook" = "http://default-runbook"
  }
 
  # Interval at which to evaluate the monitor, for example 15s, 30s, or 60s.
  # Defaults to 60s.
  interval = "30s"
 
  query {
    # PromQL query to evaluate for the alert.
    # Alternatively, you can use graphite_expr instead.
    prometheus_expr = "sum (rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) by (app, grpc_service, grpc_method)"
  }
 
  # The remaining examples are optional signals specifying how to group the
  # series returned from the query.
 
  # No signal_grouping clause = Per monitor
  # signal_grouping with label_names set = Per signal for labels set
  # signal_grouping with signal_per_series set to true = Per series
 
  signal_grouping {
    # Set of labels names used to split series into signals.
    # Each unique combination of labels results in its own signal.
    label_names = ["app", "grpc_service"]
 
    # As an alternative to label_names, signal_per_series creates an alert for
    # every resulting series from the query.
    # signal_per_series = true
  }
 
  # Container for the conditions determining the severity of each series from the query.
  # The highest severity series of a signal determines that signal's severity.
  series_conditions {
    # Condition assigning a warn threshold for series above a certain threshold.
    condition {
      # Severity of the condition, which can be "warn" or "critical".
      severity = "warn"
 
      # Value to compare against each series from the query result.
      # For EXISTS or NOT_EXISTS operators, value must be set to zero or may be omitted.
      value = 5.0
 
      # Operator to use when comparing the query result versus the threshold.
      # Valid values can be one of GT, LT, LEQ, GEQ, EQ, NEQ, EXISTS, NOT_EXISTS.
      op = "GT"
 
      # Amount of time the query needs to fail the condition check before
      # an alert is triggered. Must be an integer. Accepts one of s (seconds), m
      # (minutes), or h (hours) as units. Optional.
      sustain = "240s"
 
      # Amount of time the query needs to no longer fire before resolving. Must be
      # an integer. Accepts one of s (seconds), m (minutes), or h (hours) as units.
      resolve_sustain = "60s"
 
    }
 
    condition {
      severity = "critical"
      value    = 10.0
      op       = "GT"
      sustain  = "120s"
      resolve_sustain = "60s"
    }
 
    # Multiple optional overrides can be defined for different sets of conditions
    # to series with matching labels.
    override {
      # One or more matchers for labels on a series.
      label_matcher {
        # Name of the label
        name = "app"
 
        # How to match the label, which can be "EXACT_MATCHER_TYPE" or
        # "REGEXP_MATCHER_TYPE".
        type = "EXACT_MATCHER_TYPE"
 
        # Value of the label.
        value = "dbmon"
      }
 
      condition {
        severity = "critical"
        value    = 1.0
        op       = "GT"
        sustain  = "60s"
      }
    }
  }
 
# If you define a schedule, Observability Platform evaluates the monitor only during
# the specified time ranges. The monitor is inactive during all unspecified
# time ranges.
# If you define an empty schedule, Observability Platform never evaluates the monitor.
  schedule {
    # Valid values: Any IANA timezone string
    timezone = "UTC"
 
    range {
      # Time range for the monitor schedule. Valid values for day can be full
      # day names, such as "Sunday" or "Monday".
      # Valid time values must be specified in the range of 00:00 to 24:00.
      day   = "Monday"
      start = "07:00"
      end   = "10:10"
    }
 
    range {
      day   = "Monday"
      start = "15:00"
      end   = "22:30"
    }
 
    range {
      day   = "Thursday"
      start = "21:15"
      end   = "24:00"
    }
  }
}

Edit a monitor

Select from the following methods to edit monitors.

Users can modify Terraform-managed resources only by using Terraform. Learn more.

To edit a monitor:

In the navigation menu select Alerts > Monitors.
Click the name of the monitor you want to edit.
In the action menu, click the three vertical dots icon and select Edit monitor. This opens a sidebar where you can edit the monitor’s properties.
Make your edits, and then click Save. Refer to the monitor data model for specific definitions.

Use the Code Config tool

When adding or editing a monitor, you can click the Code Config tab to view code representations of a monitor for Terraform, Chronoctl, and the Chronosphere API. The displayed code also responds to changes you make in the Visual Editor tab.

When you modify monitor properties in the Visual Editor tab, you can click the Code Config tab to immediately see the updated code representations. This ability lets you use the Visual Editor tab to modify monitor properties and view the code representations expressed as Terraform resources, Chronoctl YAML, or API-compatible JSON.

Changes you make in the Visual Editor tab don’t take effect until you click Save or apply the code representations using their corresponding tools.

If you manage a monitor using Terraform, you must use Terraform to apply any changes.

Change code representation

To change the code representation format:

Click the Code Config tab.
Click the format dropdown. The dropdown defaults to the format of the tool that manages the resource, such as Terraform.
Select the format you want to use. You can then take several additional actions:
- To copy the viewed code representation to your clipboard, click Copy.
- To save the viewed code representation to a file, click Download.
- To view a diff of unsaved changes you’ve made to a monitor, click View Diff. This button is available only if you’ve changed the monitor in the Visual Editor tab but haven’t saved your changes.
  
  This Git-style diff of changes replaces the Copy and Download buttons with a toggle between Unified and Split diff styles, and the View Diff button with a Hide Diff button that returns you to the code representation view.
  
  You can also view unchanged lines in the context of the diff by clicking Expand X lines inside the diff.

Override a monitor alert

You can override the default conditions that define when an alert triggers for a monitor. This override is similar to overriding a notification policy that routes a notification to a notifier other than the specified default.

On a monitor, you can specify a condition override to use a separate threshold for certain series. For example, a monitor might have a default threshold of >100 but you specify an override threshold of >50 where the label key/value pair is cluster=production.

You can specify any label as a matcher for a monitor condition override. If no override matches the defined conditions, Observability Platform applies the default conditions. Additionally:

Overrides must specify at least one matcher, and meet every matcher condition to apply the override.
Observability Platform evaluates overrides in the listed order. When an override matches, the remaining overrides and defaults are ignored.
Overrides don’t inherit any properties from the default conditions. For example, if the default policy route specifies warn and critical notifiers but the override specifies only critical notifiers, the notifier doesn’t send warn notifications.

Users can modify Terraform-managed resources only by using Terraform. Learn more.

To specify a monitor alert override:

In the navigation menu select Alerts > Monitors.
Click the name of the monitor you want to specify an override for.
In the action menu, click the three vertical dots icon and select Edit monitor. This opens a sidebar where you can edit the monitor’s properties.
In the Condition Override section, click the plus icon to display the override fields.
Select Exact or Regex as the matcher type, and enter the key/value pair to match on for the override.
Select Critical or Warn as the override severity.
Define the match condition, and enter a value and sustain duration.
Click Save to apply the override changes.

Delete a monitor

Select from the following methods to delete monitors.

Users can modify Terraform-managed resources only by using Terraform. Learn more.

To delete a monitor:

In the navigation menu select Alerts > Monitors.
Click the name of the monitor you want to delete.
In the action menu, click the three vertical dots icon and select Edit monitor.
In the Edit Monitor dialog, click the three vertical dots icon and select Delete.

Use annotations with monitors

Create annotations for monitors that link to dashboards, runbooks, related documents, and trace metrics, which lets you provide direct links for your on-call engineers to help diagnose issues.

You can reference Prometheus Alertmanager variables in annotations with the {{.VARIABLE_NAME }} syntax. Annotations can access monitor labels by using variables with the {{ .CommonLabels.LABEL }} pattern, and from the alerting metric with the {{ .Labels.LABEL }} pattern. In both patterns, replace LABEL with the label’s name.

To reference labels in Alertmanager variables, you must include those labels in the alerting time series. Otherwise, the resulting notifier won’t display any information for the variables you specify.

The following examples include annotations with variables based on a template. See the Alertmanager documentation (opens in a new tab) for a reference list of alerting variables and templating functions.

To add annotations to a monitor:

Create a monitor.

In the Annotations section, add a description for your annotation in the Key field, and text or links in the Value field. For example, you might add the following key/value pairs as annotations:

Key	Value
summary	Instance `{{$labels.instance}}` is down
description	Container `{{ $labels.namespace }}`/`{{ $labels.pod }}`/`{{ $labels.container }}` terminated with `{{ $labels.reason }}`.
runbook	`http://default-runbook`

Configure alerts Monitor data model