Monitor Collector

Monitor the Collector

If Collectors are unreachable, the self-scraping metrics might not reach Chronosphere. In these cases, you can create a monitor for kube-state-metrics key to gain visibility into the health of a Collector.

This method of monitoring the Collector assumes that you're scraping kube-state-metrics, and that your Collectors are healthy.

The following examples assume that Collector pods names include the chronocollector prefix.

DaemonSet Collector is unavailable

You can create a monitor for the following kube-state-metrics to determine whether a Collector that's running as a DaemonSet is unavailable.

Pods are unavailable

Monitor the following metrics to help determine whether DaemonSet pods are unavailable.

(kube_daemonset_status_number_unavailable{daemonset="chronocollector"} * on (instance, host_ip, cluster) group_left(pod) kube_pod_info{pod=~"chronocollector.+", pod_ip!="", created_by_kind="DaemonSet"}) > 0

(kube_daemonset_status_desired_number_scheduled{daemonset="chronocollector"} * on (instance, host_ip, cluster) group_left(pod) kube_pod_info{pod=~"chronocollector.+", pod_ip!="", created_by_kind="DaemonSet"} - on (pod) kube_daemonset_status_number_available{daemonset="chronocollector"} * on (instance, host_ip, cluster) group_left(pod) kube_pod_info{pod=~"chronocollector.+", pod_ip!="", created_by_kind="DaemonSet"}) > 0

Pod is restarting

Monitor the following metrics to help determine if a kube-state-metrics Collector pod is restarting. The following query reviews the past five minutes to return any pod restarts for the kube_pod_container_status_restarts_total metric:

rate(kube_pod_container_status_restarts_total{pod=~"chronocollector.+"}[5m]) > 0

Container is restarting

A container might restart for various reasons. For example, the Collector containers could be stuck in a crash loop due to a failed deployment, or there might be underlying platform issues. You can examine the service container logs to help determine what caused a container to restart, and also review the kube_pod_container_status_terminated_reason metrics.

Create a monitor with the following query to return the rate of increase of the kube_pod_container_status_restarts_total metric over the past 15 minutes. This metric is helpful for detecting container restarts:

increase(kube_pod_container_status_restarts_total{container=~"chronocollector"}[15m])

Drops in the number of scraped targets

A significant drop in the number of scrape targets can indicate issues with the Collector, or with the metric endpoints themselves.

Create a monitor with the following query to calculate the percentage of active targets the Collector is scraping for a given instance. The query divides the rate of change in the number of active targets over the past hour by the maximum number of active targets over the past 24 hours:

sum by (instance) (delta(chronocollector_k8s_gatherer_processor_targets_active{job=~".*chronocollector.*"}[1h] offset 1m))/
sum by (instance) (max_over_time(chronocollector_k8s_gatherer_processor_targets_active{job=~".*chronocollector.*"}[1h] offset 24h)) * 100

Calculate scrape latency

Create a monitor with the following query to calculate scrape latency. The query calculates the P99 latency for scrapes of chronocollector_scrape_latency_bucket metric values grouped by job and le labels:

histogram_quantile(0.99, sum by(job, le) (rate(chronocollector_scrape_latency_bucket{job=~".*chronocollector.*"}[1m])))

Detect error rates

Create a monitor with the following query to detect Collector error rates. The query calculates the error rate of the Collector within a one-minute period:

1 - (sum by (instance) (rate(chronocollector_gateway_push_errors{job=~".*chronocollector.*"}[1m])))/(sum by (instance) (rate(chronocollector_gateway_push_errors{job=~".*chronocollector.*"}[1m])) + sum by (instance) (rate(chronocollector_gateway_push_success{job=~".*chronocollector.*"}[1m])))

The output is a value between 0 and 1 that represents the error rate as a percentage of the total rate for the chronocollector job over the specified period. A value closer to 0 indicates a lower error rate, while a value closer to 1 indicates a higher error rate.

Deployment Collector is unavailable

When the Collector is actively scraping metrics, an up metric is continuously generated for each target. Think of this metric as a heartbeat for the Collector. If the up metric is unavailable, that means the Collector can't scrape kube-state-metrics, which means you can't use kube-state-metrics to determine whether your Collector is healthy.

To track the up metric, create a monitor that generates a notification if this metric is unavailable.

up{instance=~"chronocollector-ksm"} > 0

Troubleshoot metric ingestion

If the chronoCollector can't start or scrape metrics, there are a few methods to troubleshoot issues. These steps assume that you're running the Collector in a Kubernetes environment.

Review the Collector logs for errors. Run the following command to view the Collector logs:

kubectl logs ds/chronocollector

For example, the following error indicates the scrape timeout is too low for the job default of 10s:

{"level":"info","ts":1578154502.096278,"msg":"","level":"debug","scrape_pool":"collector","target":"http://0.0.0.0:9100/metrics","msg":"Scrape failed","err":"Get http://0.0.0.0:9100/metrics: context deadline exceeded"}

If the logs don't provide any insights, ensure the API token and gateway address exist as a secret in the Kubernetes cluster:

kubectl get secrets

If there's no output to your terminal, a Secret doesn't exist. Create a Secret using kubectl:
```
kubectl create secret generic chronosphere-secret \
  --from-literal=api-token=API_TOKEN \
  --from-literal=address=ADDRESS
```
Replace the following:
- API_TOKEN: the API token generated from your service account.
- ADDRESS: your company name prefixed to your Chronosphere instance that ends in .chronosphere.io:443. For example, MY_COMPANY.chronosphere.io:443.
Ensure annotations are properly configured in each of the pods you want to scrape. Refer to Collector Kubernetes annotations for more details.

Verify traces Mappings