Monitor migration

Convert Datadog monitors

Datadog monitors (opens in a new tab) actively check metrics of the infrastructure and manage alerts on alert platforms. Chronosphere uses monitors and alerts for the same purposes.

Datadog creates monitoring and notification in one longer file, while Chronosphere separates monitors and notifications into smaller logical configuration files. These smaller files enable users to target and update specific changes without risking the entire configuration.

Compare configurations

These are examples of matching configurations for Datadog and Chronosphere.

This is an example of a Datadog monitor definition.

"id": 1234567,
    "org_id": 12345,
    "type": "metric alert",
    "name": "IOWAIT is high ({{value}})",
    "message": "{{#is_alert}}\Load is too high, check and lower load immediately (use AWS console for {{pod.name}} to scale tasks to 1 and investigate)\n@slack-ops-bots \n{{/is_alert}} \n\n{{#is_alert_recovery}}\n@slack-ops-bots \n@pagerduty-resolve \n{{/is_alert_recovery}}{{#is_warning}}\nLoad is reaching the limit.\n@slack-ops-warning-bots\n@pagerduty{{/is_warning}}",
    "tags": [
        "high-load",
        "team:platform"
    ],
    "query": "min(last_30m):max:system.cpu.iowait{function:cassandraevents} by {pod,name} > 20",
    "options": {
        "notify_audit": false,
        "locked": false,
        "timeout_h": 0,
        "include_tags": true,
        "no_data_timeframe": 30,
        "require_full_window": true,
        "notify_by": ["pod"],
        "notify_no_data": true,
        "new_group_delay": 60,
        "renotify_interval": 30,
        "renotify_occurrences": 1,
        "renotify_statuses": [
            "alert",
            "no data",
        ],
        "scheduling_options": {
            "evaluation_window": {
                "hour_starts": 30
            }
        },
        "thresholds": {
            "critical": 20,
            "critical_recovery": 10,
            "warning": 15
        },
        "timeout_h": 12,
        "escalation_message": "{{#is_alert}}\nEscalated to pagerduty - \nLoad is too high, check and lower load immediately (use AWS console for {{pod.name}} to scale tasks to 1 and investigate)\n@slack-ops-bots \n@pagerduty \       n{{/is_alert}}",
        "evaluation_delay": 300,
        "min_failure_duration": 120,
        "silenced": {}
    },
    "multi": true,
    "created_at": 1479858941000,
    "created": "2016-11-22T15:55:41.80188-08:00",
    "modified": "2021-10-14T09:23:36.750186-07:00",
    "deleted": null,
    "restricted_roles": null,
    "priority": 1,
    "overall_state_modified": "2022-07-05T06:13:14-07:00",
    "overall_state": "OK",
    "creator": {
        "name": "Jane Smith",
        "handle": "janesmith@example.com",
        "email": "janesmith@example.com",
        "id": 18219
    },
    "matching_downtimes": []

Field mapping

Chronosphere and Datadog fields have many equivalent functions. Use the following tables to map fields between these apps.

Names of Chronosphere equivalents are subject to change as the conversion process improves.

Configuration mapping

This table matches Datadog fields to their Chronosphere equivalents for monitor specification.

Datadog fieldChronosphere equivalent
createdN/A
creatorN/A
idAdd to Monitor.labels.
messageAdd to Monitor.annotations and create Notify routes - See details.
modifiedN/A
multiMonitor.spec.signal_grouping.signal_per_series
nameMonitor.name - This can also contain variables.
optionsMonitor options
threshold_windowsN/A - Used only for anomalies.
thresholdsMonitor.spec.series_conditions.severity_conditions .conditions
timeout_hN/A
overall_stateFor monitors with an Ignored / Skipped / Unknown state, still create the monitor but have it either go to a black hole route or create it as muted.
priorityCan support as a message annotation.
queryMonitor.spec.query.expr
restricted_rolesN/A
stateN/A
matching_downtimesEquivalent to schedules.
tagsAn arbitrary list of strings that fits the tag format (which can be single word tags). Chronosphere can support this using Monitor.labels, if the field requires a key/value format. Tags are used as label names with the value set to true.
typeThe type of monitor. Chronosphere supports query alert and metric alerts.

Monitor options

Use these values in the specification's options field.

Datadog fieldHow to map
aggregationN/A - For log alerts only.
enable_logs_sampleN/A - For log alerts only.
enable_samplesN/A - Per Datadog docs (opens in a new tab). This is used only by CI Test and Pipeline monitors.
escalation_messageNo separate message for renotify notifications; can append this to the generic alert message.
evaluation_delayCan support by using offset in the query.
group_retention_durationN/A - Not for metrics monitors.
groupby_simple_monitorN/A - For log alerts only.
include_tagsUse Prometheus {{ $value }} template.
min_failure_durationMonitor.spec.series_conditions.severity_conditions.conditions.sustain
min_location_failedCan support by adding thresholds to the PromQL expression.
new_group_delayN/A
new_host_delayN/A - Deprecated, use new_group_delay instead.
no_data_timeframeThreshold for a no data alert. See severity section for details.
notification_preset_nameN/A - Datadog docs (opens in a new tab).
notify_auditN/A
notify_byEquivalent to Monitor.spec.signal_grouping, except the inverse. Note: This can be set to *, which is the same as setting Monitor.spec.signal_grouping.signal_per_series.
notify_no_dataAdd a NOT_EXISTS series condition in the MonitorSpec. Review severity for details.
on_missing_dataN/A - Not for metrics alerts.
renotify_intervalNotificationPolicy.routes.overrides.notifiers.repeat_interval
renotify_occurencesN/A
renotify_statusesOnly renotify on status X. Create overrides using NotificationPolicy.routes.overrides.notifiers.repeat_interval for each severity listed here.
require_full_windowOnly evaluate if there's a full window of data. Datadog recommends setting this to false. Supportable using the count_over_time function.
scheduling_evaluation_windowCumulative time windows (opens in a new tab). For example, "evaluate this alert every hour on the :00 mark".
silencedDictionary of muted tags to end timestamp (opens in a new tab). Create MutingRule objects for each tag.
thresholdsThresholds for severity. Can map to MonitorSpec.series_conditions.severity_conditions for warning and critical. No support for separate thresholds for recovery.
variablesN/A

Severity

Chronosphere supports both critical and warning severities by implementing different thresholds for the metric values. In addition to this, Datadog also supports alerting on no data for a particular metric as a distinct severity. While this state isn't a true severity, the state is treated the same as critical and warning alerts for configuration.

Chronosphere supports alerting on no data conditions using a series condition in the MonitorSpec:

api_version: v2
kind: Monitor
spec:
  spec:
    query:
      prometheus:
        expr: <promql query>
    series_conditions:
      defaults:
        critical:
          conditions:
            - op: NOT_EXISTS
              sustain: 60s

Message and route

Datadog allows different messages and routing endpoints for the different severity levels (critical, warning, no data). Chronosphere can support different messages by using separate annotations:

 
api_version: v2
kind: Monitor
spec:
  annotations:
    - name: message_critical
      value: This is the critical threshold message
    - name: message_warning
      value: This is the warning threshold message
    - name: message_no_data
      value: This is the message for no data

To support different routes, users must use a separate monitor with different labels, set using notification policies.

Notification policy resources

Link a Monitor resource to a Notification resource by defining a notification policy. Each unique route in the Datadog message field maps to a Notification resource. The Monitor contains a label specifying the notification route it links to, and the default NotificationPolicy defines overrides that point to each Notification resource.

For example:

api_version: v2
kind: Monitor
spec:
  labels:
    - name: datadog_id
      value: 1234567
    - name: route_slack_ops_bots_critical
      value: true
    - name: route_slack_ops_bots_warning
      value: true
    - name: route_pagerduty_critical
      value: true
 
---
api_version: v2
kind: NotificationPolicy
spec:
    routes:
      overrides:
        - alert_label_matchers:
          - {name: route_slack_ops_bots_critical, type: EXACT_MATCHER_TYPE, value: critical}
            notifiers:
              critical:
              - slug: slack-ops-bots
                name: slack-ops-bots
          - {name: route_pagerduty_critical, type: EXACT_MATCHER_TYPE, value: true}
            notifiers:
              critical:
              - slug: pagerduty-critical
                name: pagerduty-critical
        - alert_label_matchers:
          - {name: route_slack_ops_bots_warning, type: EXACT_MATCHER_TYPE, value: true}
            notifiers:
              warning:
              - slug: slack-ops-warning-bots
                name: slack-ops-warning-bots

Evaluation frequency

Datadog doesn't support the use of different evaluation frequencies per monitor, but instead relies on a hard-coded interval dependant on the evaluation window (opens in a new tab). For windows of less than 24h, the window defaults to 1m. Set this to a desired value with the MonitorSpec.interval field, or default to 15s to receive faster alerts.