Monitors

View and create monitors

One of the reasons to ingest and store time series data is to know when data meets or doesn't meet certain criteria. Use Chronosphere Observability Platform alerting to generate alerts and notifications from data, whether it's about your system or about your usage of Observability Platform itself.

For an overview of the Observability Platform approach to alerting, see the Introducing Monitors: A better way to manage and interact with alerts (opens in a new tab) Chronosphere blog article.

View available monitors

You can view and filter monitors using Observability Platform, Chronoctl, or the Code Config tool.

To query and get detailed information about monitors, see Monitor details.

To display a list of defined monitors, in the navigation menu select Alerts > Monitors.

The list of monitors displays the status for each monitor next to its title:

IconDescription
Currently alerting monitor that exceeds the defined critical conditions.
Currently alerting monitor that exceeds the defined warning conditions.
Monitor that's currently muted by an active muting rule.
Passing monitor that's not generating alerts.

You can filter your monitors using the following methods:

  • Using the Search monitors search box (an OR filter).
  • By team, using the Select a team dropdown.
  • By owner, using the Select an owner dropdown. The icon indicates the monitor is part of a collection. The icon indicates this monitor is part of a service.
  • By notification policy, using the Select a notification policy dropdown.
  • By error status.

Monitors with defined signals display the file tree icon. To view the signals from a displayed monitor, click the name of the monitor from the list.

From a monitor's detail page, you can click the name of a signal from the Signals section to filter the query results to alerts only from that signal.

To search for a specific monitor:

  1. Click the search bar to focus on it, or use the keyboard shortcut Control+K (Command+K on macOS).
  2. Begin typing any part of the monitor's name.
  3. Optional: Click the filters for all other listed resource types at the top of the search results to remove them and display only monitor.
  4. Click the desired search result, or use the arrow keys to select it and press enter, to go to that monitor.

Create a monitor

You can create monitors using Observability Platform, Chronoctl, or Terraform. Most monitors alert when a value matches a specific condition, such as when an error condition defined by the query lasts longer than one minute.

You can also choose to alert when a value doesn't exist, such as when a host stops sending metrics and is likely unavailable. This condition triggers only if the entire monitor query returns no results. For example, to alert on missing or no data, add a NOT_EXISTS series condition in the series_conditions section of the monitor definition:

    series_conditions:
      defaults:
        critical:
          conditions:
            - op: NOT_EXISTS
              sustain: 60s

To receive alerts when a host stops sending metrics, create a separate monitor for each host and scope the monitor query to that host.

Prerequisites

Before creating a monitor, complete the following tasks:

  1. Create a notifier to define where to deliver alerts and who to notify.
  2. Create a notification policy to determine how to route notifications to notifiers based on signals that trigger from your monitor. You select the notifier you created for the critical or warning conditions on the notification policy.

You can then use any of the following methods to create a new monitor.

To add a new monitor:

  1. In the navigation menu select one of these locations:

    • Alerts > Monitors.
    • Platform > Collections, and then select the collection you want to create a monitor for. This can be a standard collection or a service.
  2. Create the monitor:

    • From the Monitors page, click Create monitor.
    • From the Collections page, in the Monitors panel, click + Add.
  3. Enter the information for the monitor based on its data model.

  4. Select an Owner to organize and filter your monitor. You can select a collection or a service.

  5. Enter a Monitor Name.

  6. Choose a Notification Policy to determine which notification policy to use at a particular alert severity.

  7. Enter Labels as key/value pairs to categorize and filter monitors.

  8. Enter a valid Prometheus or Graphite query in the Query section to generate a preview of query results. Use the preview to review your query for syntax errors, and open the query in the Query Builder to construct, optimize, and debug your query before saving it.

  9. Optional: Group alerts based on the results returned from the query by choosing an option in the Signals section.

    If you select per signal (multiple alerts) to generate multiple alerts, enter a label key that differs in name and casing from the label you enter in the Key field in the Labels section. For example, if you enter environment in the Key field, you might use Environments as the Label Key to match on. Global filters can be used as a Label Key.

  10. Define a condition and sustain period (duration of time) in the Conditions section, and assign the resulting alert a severity (warning or critical). In the Sustain field, enter a value followed by an abbreviated unit such as 60s. Valid units are s (seconds), m (minutes), h (hours), or d (days). The dialog also displays the notifiers associated with the monitor for reference.

    To alert on missing or no data, select not exists in the Alert when value dropdown.

  11. In the Resolve field, enter a time period for the resolve window as a value followed by an abbreviated unit such as 30s. Valid units are s (seconds), m (minutes), h (hours), or d (days).

  12. Add notes for the monitor in the Annotations section, such as runbooks, links to related dashboards, data links to related traces, and documentation links.

  13. Click Save.

Chronoctl monitor example

The following YAML definition consists of one monitor named Disk Getting Full. The series_conditions trigger a warning notification when the disk is 80% full for more than 300 seconds, and a critical notification when 90% full for more than 300 seconds. It groups series into signals based on the source and service_environment label keys.

The schedule section indicates that this monitor runs each week on Mondays from 7:00 to 10:10 and 15:00 to 22:30, and Thursdays from 21:15 through the end of the day. All times are in UTC.

If you define label_names in the signal_grouping section, enter a label name that differs in name and casing from the label you enter in the labels section. For example, if you enter environment as a key in the labels section, you might use Environments in the label_names section.

api_version: v1/config
kind: Monitor
spec:
    # Required name of the monitor. Can be modified after the monitor is created.
    name: Disk Getting Full
    # PromQL query. If set, you can't set graphite_query.
    prometheus_query: max(disk:last{measurement="used_percent"}) by (source, service_environment, region)
    # Annotations are visible in notifications generated by this monitor.
    # You can template annotations with labels from notifications.
    annotations:
        key_1: "{{ $labels.job }}"
    # Slug of the collection the monitor belongs to.
    collection_slug: loadgen
    # Optional setting for configuring how often alerts are evaluated.
    # Defaults to 60 seconds.
    interval_secs: 60
    # Labels are visible in notifications generated by this monitor,
    # and can be used to route alerts with notification overrides.
    labels:
        key_1: kubernetes_cluster
    # Optional notification policy used to route alerts generated by the monitor.
    notification_policy_slug: custom-notification-policy
    schedule:
        # The timezone of the time ranges.
        timezone: UTC
        weekly_schedule:
            monday:
                active: ONLY_DURING_RANGES
                # The time ranges that the monitor is active on this day. Required if
                # active is set to ONLY_DURING_RANGES.
                ranges:
                    - # End time in the in format "<hour>:<minute>", such as "15:30".
                      end_hh_mm: "15:00"
                      # Start time in the in format "<hour>:<minute>", such as "15:30".
                      start_hh_mm: "10:10"
            tuesday:
                active: NEVER
            wednesday:
                active: NEVER
            thursday:
                active: ONLY_DURING_RANGES
                # The time ranges that the monitor is active on this day. Required if
                ranges:
                # active is set to ONLY_DURING_RANGES.
                    - # End time in the in format "<hour>:<minute>", such as "15:30".
                      end_hh_mm: "24:00"
                      # Start time in the in format "<hour>:<minute>", such as "15:30".
                      start_hh_mm: "21:15"
            friday:
                active: NEVER
            saturday:
                active: NEVER
            sunday:
                active: NEVER
# Conditions evaluated against each queried series to determine the severity of each series.
series_conditions:
    defaults:
        critical:
            # List of conditions to evaluate against a series.
            # Only one condition must match to assign a severity to a signal.
            conditions:
               # To alert on missing or no data, change the value for `op` to `NOT_EXISTS`.
                - op: GT
                # How long the op operation needs to evaluate for the condition
                # to evaluate to true.
                sustain_secs: 300
                # The value to compare to the metric value using the op operation.
                value: 60
                # How long the operation needs to evaluate false to resolve
                resolve_sustain: 60
        warn:
                # List of conditions to evaluate against a series.
                # Only one condition must match to assign a severity to a signal.
            conditions:
                - op: GT
                # How long the op operation needs to evaluate for the condition
                # to evaluate to true.
                sustain_secs: 300
                # The value to compare to the metric value using the op operation.
                value: 30
                # How long the operation needs to evaluate false to resolve
                resolve_sustain: 60
# Defines how the set of series from the query are split into signals.
signal_grouping:
    label_names:
        - source
        - service_environment
        # If true, each series will have its own signal and label_names can't be set.
    signal_per_series: false

Terraform monitor example

The following Terraform resource creates a monitor that Terraform refers to by infra, and with a human-readable name of Infra Example monitor.

The schedule section runs this monitor each week on Mondays from 7:00 to 10:10 and 15:00 to 22:30, and Thursdays from 21:15 through the end of the day. All times are UTC, and Observability Platform won't run this monitor during the rest of the week.

If you define label_names in the signal_grouping section, enter a label name that differs in name and casing from the label you enter in the labels section. For example, if you enter environment as a key in the labels section, you might use Environments in the label_names section.

resource "chronosphere_monitor" "infra" {
  name = "Infra Example monitor"
 
  # Reference to the collection the alert belongs to.
  collection_id = chronosphere_collection.infra.id
 
  # Override the notification policy.
  # By default, uses the policy from the collection_id.
  notification_policy_id = chronosphere_collection.infra_testing.id
 
  # Arbitrary set of labels to assign to the alert.
  labels = {
    "priority" = "sev-1"
  }
 
  # Arbitrary set of annotations to include in alert notifications.
  annotations = {
    "runbook" = "http://default-runbook"
  }
 
  # Interval at which to evaluate the monitor, for example 15s, 30s, or 60s.
  # Defaults to 60s.
  interval = "30s"
 
  query {
    # PromQL query to evaluate for the alert.
    # Alternatively, you can use graphite_expr instead.
    prometheus_expr = "sum (rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) by (app, grpc_service, grpc_method)"
  }
 
  # The remaining examples are optional signals specifying how to group the
  # series returned from the query.
 
  # No signal_grouping clause = Per monitor
  # signal_grouping with label_names set = Per signal for labels set
  # signal_grouping with signal_per_series set to true = Per series
 
  signal_grouping {
    # Set of labels names used to split series into signals.
    # Each unique combination of labels results in its own signal.
    label_names = ["app", "grpc_service"]
 
    # As an alternative to label_names, signal_per_series creates an alert for
    # every resulting series from the query.
    # signal_per_series = true
  }
 
  # Container for the conditions determining the severity of each series from the query.
  # The highest severity series of a signal determines that signal's severity.
  series_conditions {
    # Condition assigning a warn threshold for series above a certain threshold.
    condition {
      # Severity of the condition, which can be "warn" or "critical".
      severity = "warn"
 
      # Value to compare against each series from the query result.
      # For EXISTS or NOT_EXISTS operators, value must be set to zero or may be omitted.
      value = 5.0
 
      # Operator to use when comparing the query result versus the threshold.
      # Valid values can be one of GT, LT, LEQ, GEQ, EQ, NEQ, EXISTS, NOT_EXISTS.
      op = "GT"
 
      # Amount of time the query needs to fail the condition check before
      # an alert is triggered. Must be an integer. Accepts one of s (seconds), m
      # (minutes), or h (hours) as units. Optional.
      sustain = "240s"
 
      # Amount of time the query needs to no longer fire before resolving. Must be
      # an integer. Accepts one of s (seconds), m (minutes), or h (hours) as units.
      resolve_sustain = "60s"
 
    }
 
    condition {
      severity = "critical"
      value    = 10.0
      op       = "GT"
      sustain  = "120s"
      resolve_sustain = "60s"
    }
 
    # Multiple optional overrides can be defined for different sets of conditions
    # to series with matching labels.
    override {
      # One or more matchers for labels on a series.
      label_matcher {
        # Name of the label
        name = "app"
 
        # How to match the label, which can be "EXACT_MATCHER_TYPE" or
        # "REGEXP_MATCHER_TYPE".
        type = "EXACT_MATCHER_TYPE"
 
        # Value of the label.
        value = "dbmon"
      }
 
      condition {
        severity = "critical"
        value    = 1.0
        op       = "GT"
        sustain  = "60s"
      }
    }
  }
 
# If you define a schedule, Observability Platform evaluates the monitor only during
# the specified time ranges. The monitor is inactive during all unspecified
# time ranges.
# If you define an empty schedule, Observability Platform never evaluates the monitor.
  schedule {
    # Valid values: any IANA timezone string
    timezone = "UTC"
 
    range {
      # Time range for the monitor schedule. Valid values for day can be full
      # day names, such as "Sunday" or "Monday".
      # Valid time values must be specified in the range of 00:00 to 24:00.
      day   = "Monday"
      start = "07:00"
      end   = "10:10"
    }
 
    range {
      day   = "Monday"
      start = "15:00"
      end   = "22:30"
    }
 
    range {
      day   = "Thursday"
      start = "21:15"
      end   = "24:00"
    }
  }
}

Edit a monitor

You can edit monitors using Observability Platform, Chronoctl, or Terraform.

Users cannot modify Terraform-managed resources in the user interface, with Chronoctl, or by using the API. Learn more.

To edit a monitor:

  1. In the navigation menu select Alerts > Monitors.
  2. Click the name of the monitor you want to edit.
  3. To the right of the monitor's name, click the three vertical dots icon and select Edit monitor. This opens a sidebar where you can edit the monitor's properties.
  4. Make your edits, and then click Save. Refer to the monitor data model for specific definitions.

Use the Code Config tool

When adding or editing a monitor, you can click the Code Config tab to view code representations of a monitor for Terraform, Chronoctl, and the Chronosphere API. The displayed code also responds to changes you make in the Visual Editor tab.

When you modify monitor properties in the Visual Editor tab, you can click the Code Config tab to immediately see the updated code representations. This ability lets you use the Visual Editor tab to modify monitor properties and view the code representations expressed as Terraform resources, Chronoctl YAML, or API-compatible JSON.

Changes you make in the Visual Editor tab don't take effect until you click Save or apply the code representations using their corresponding tools.

If you manage a monitor using Terraform, you must use Terraform to apply any changes.

Change code representation

To change the code representation format:

  1. Click the Code Config tab.

  2. Click the format dropdown. The dropdown defaults to the format of the tool that manages the resource, such as Terraform.

  3. Select the desired format. You can then take several additional actions:

    • To copy the viewed code representation to your clipboard, click Copy.

    • To save the viewed code representation to a file, click Download.

    • To view a diff of unsaved changes you've made to a monitor, click View Diff. This button is available only if you've changed the monitor in the Visual Editor tab but haven't saved your changes.

      This Git-style diff of changes replaces the Copy and Download buttons with a toggle between Unified and Split diff styles, and the View Diff button with a Hide Diff button that returns you to the code representation view.

      You can also view unchanged lines in the context of the diff by clicking Expand X lines... inside the diff.

Override a monitor alert

You can override the default conditions that define when an alert triggers for a monitor. This override is similar to overriding a notification policy that routes a notification to a notifier other than the specified default.

On a monitor, you can specify a condition override to use a separate threshold for certain series. For example, a monitor might have a default threshold of >100 but you specify an override threshold of >50 where the label key/value pair is cluster=production.

You can specify any label as a matcher for a monitor condition override. If no override matches the defined conditions, Observability Platform applies the default conditions. Additionally:

  • Overrides must specify at least one matcher, and meet every matcher condition to apply the override.
  • Observability Platform evaluates overrides in the listed order. When an override matches, the remaining overrides and defaults are ignored.
  • Overrides don't inherit any properties from the default conditions. For example, if the default policy route specifies warn and critical notifiers but the override specifies only critical notifiers, the notifier doesn't send warn notifications.

Users cannot modify Terraform-managed resources in the user interface, with Chronoctl, or by using the API. Learn more.

To specify a monitor alert override:

  1. In the navigation menu select Alerts > Monitors.
  2. Click the name of the monitor you want to specify an override for.
  3. To the right of the monitor's name, click the three vertical dots icon and select Edit monitor. This opens a sidebar where you can edit the monitor's properties.
  4. In the Condition Override section, click the plus icon to display the override fields.
  5. Select Exact or Regex as the matcher type, and enter the key/value pair to match on for the override.
  6. Select Critical or Warn as the override severity.
  7. Define the match condition, and enter a value and sustain duration.
  8. Click Save to apply the override changes.

Delete a monitor

You can delete monitors using Observability Platform, Chronoctl, or Terraform.

Users cannot modify Terraform-managed resources in the user interface, with Chronoctl, or by using the API. Learn more.

To delete a monitor:

  1. In the navigation menu select Alerts > Monitors.
  2. Click the name of the monitor you want to delete.
  3. To the right of the monitor's name, click the three vertical dots icon and select Edit monitor.
  4. In the Edit Monitor dialog, click the three vertical dots icon and select Delete.

Use annotations with monitors

Create annotations for monitors that link to dashboards, runbooks, related documents, and trace metrics, which lets you provide direct links for your on-call engineers to help diagnose issues.

You can reference Prometheus Alertmanager variables in annotations with the {{.VARIABLE_NAME }} syntax. Annotations can access monitor labels by using variables with the {{ .CommonLabels.LABEL }} pattern, and from the alerting metric with the {{ .Labels.LABEL }} pattern. In both patterns, replace LABEL with the label's name.

To reference labels in Alertmanager variables, you must include those labels in the alerting time series. Otherwise, the resulting notifier won't display any information for the variables you specify.

The following examples include annotations with variables based on a template. See the Alertmanager documentation (opens in a new tab) for a reference list of alerting variables and templating functions.

To add annotations to a monitor:

  1. Create a monitor.

  2. In the Annotations section, add a description for your annotation in the Key field, and text or links in the Value field. For example, you might add the following key/value pairs as annotations:

    KeyValue
    summaryInstance {{$labels.instance}} is down
    descriptionContainer {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }} terminated with {{ $labels.reason }}.
    runbookhttp://default-runbook

Troubleshoot monitors and alerts

Use this information to help troubleshoot monitors and alerts.

Notifier doesn't trigger after a change, but alert is firing

Notifications send to the notifier after an alert triggers. Therefore, any change to the notifier takes effect after the next time that the alert triggers. To resolve this issue, either:

  • Wait until the alert fires again. The default repeat interval is 1 hour.
  • Recreate the alert.

Resolve unexpected alerting behavior

Use monitors to alert individuals or teams when data from a metric meets certain conditions. If monitors aren't configured correctly, they might send unexpected alerts, or might not send alerts when they should. Use the following methods to investigate and resolve unexpected behavior.

Check alerting thresholds

When creating a monitor, you define a condition and sustain period. If a time series triggers that condition for the sustain period, Observability Platform generates an alert.

To investigate an alert that's not notifying as intended, review the alerting threshold:

  1. Open the monitor you want to investigate.

  2. In the Query Results section, click the Show Thresholds toggle on the selected monitor to display the alerting thresholds for the monitor.

    A threshold line displays on the line graph for you to visualize whether your query broke the threshold, and for how long.

If your monitor is consistently breaking the defined threshold, consider modifying the defined conditions.

Review monitor alert metrics

After examining alerting thresholds, view the ALERTS and ALERTS_VALUE metrics:

  • ALERTS is a metric that shows the status of all monitors in Observability Platform. An ALERTS metric exists with a value of 1 for a monitor when it's status is pending or firing, and doesn't exist when the alert threshold isn't met.
  • ALERTS_VALUE is a metric that shows the results of a monitor's evaluation. This metric can help determine whether the value of the monitor's evaluations exceeded the threshold.
  1. Open the monitor you want to investigate.

  2. Copy the name of the monitor from the monitor header.

  3. Click Open in Explorer to open the monitor query in Metrics Explorer.

  4. In the Metrics field, enter the following query:

    ALERTS{alertname="ALERT-NAME"}

    Replace ALERT-NAME with the name of the alert you copied previously.

  5. Click Run Query.

  6. In the table, the alertstate is either pending or firing:

    • pending indicates that the monitor met the defined criteria, but not the sustain period.
    • firing indicates that the monitor met both the defined criteria and the sustain period.
  7. In the Metrics field, enter the following query:

    ALERTS_VALUE{alertname="ALERT-NAME"}
  8. Click Run Query.

  9. Review the line graph to determine when the monitor starts alerting, and to identify any gaps in the data.

Pairing the ALERTS{alertname="ALERT-NAME"} query with your monitor query in the same graph can help determine the exact time when a monitor begins to alert.

The ALERTS_VALUE{alertname="ALERT-NAME"} query can identify gaps that can occur from latent data that's not included in the evaluation set.

Add offsets to your query

Not all metric data is ingested and available near real-time when evaluating a monitor query. This latency can affect the outcome of your monitor's results, which can cause false positive or negative alerts if not handled properly.

When querying for different metric data types, it's important to understand where Observability Platform ingests the data from. Some exporters that rely on third-party APIs experience throttling and polling delays, which impacts the data you want to alert on in your monitor query.

For example, Prometheus CloudWatch has an average polling delay of 10 minutes, which results in metric ingestion that lags the current time by that amount. Read the Prometheus CloudWatch Exporter (opens in a new tab) documentation for an example.

To address this behavior in your monitors, add an offset modifier to your monitor query that's equal to or exceeds any metric polling delays. This setting forces the monitor to poll older data, but ensures that all delayed data is available when evaluating the query. Based on the Prometheus CloudWatch Exporter example, set offset 10m in your monitor query to account for the polling delay.

The following query uses an offset of one minute to look back and ensure that the rollup results are fully calculated:

histogram_quantile(0.99, sum(rate(graphql_request_duration_seconds_bucket{namespace=~"consumer-client-api-gateway",operationType!="unknown",sub_environment=~"production",operationName=~"setStorefrontUserLocalePreference"}[2m] offset 1m)) by (le,operationName,operationType))