Metric types

Chronosphere supports both Prometheus and StatsD metrics types.

For more information about the metric types for each client, see:

Chronosphere also supports the OpenTelemetry Collector.

Counters

Counters are a fundamental metric construct that keeps track of the number of times a certain event has occurred. Prometheus and StatsD track counters in different ways.

With running counts, metric clients keep an ever-increasing total sum of the number of events for the client's lifetime. A metrics client is periodically measured (either pushed or scraped) by a metrics backend. With this model, each measurement is a snapshot of the total value lifetime value of the counter.

One advantage to using running counts is if a value measurement is lost during transmission to the metric backend, the next successful measurement recaptures the count. The gap caused by the lost measurement can be interpolated and filled in at query time.

The disadvantage of running counts is that when querying or graphing these metrics, you first need to apply a rate or per second function to visualize the changes per time window and to aggregate multiple count metrics together. Aggregating a running sum can reset when the client resets, and there is no way to sum these resets across time series accurately.

Prometheus client

The Prometheus client uses running counts scraped at periodic intervals. PromQL (opens in a new tab), the Prometheus query language, has sophisticated support for rate methods that can interpolate missing data and deal with client resets.

Best practices

When aggregating counters, use PromQL's rate, irate, or increase functions.

When writing a rate aggregation, Chronosphere recommends choosing a range that's at least four times the amount of the scrape interval. For example, use a one-minute range for a 15-second scrape interval, and use a two-minute range for a 30-second scrape interval.

Gauges

Gauges are metrics used to take measurements at a single point in time periodically. An example of use is capturing memory usage or CPU utilization, which constantly fluctuates up and down.

Most, if not all, metric systems deal with gauges as discrete values and allow operations on them, so there isn't a major difference between how different clients manage them.

Timers - percentiles and histograms

Latencies are one of the most important metrics to keep track of in a microservices architecture. There are a few different ways to monitor latencies that have advantages and disadvantages.

Percentiles

Percentiles are the most common way of monitoring latencies (for example p50, p90, or p99) when you require the exact latency value or sample. Percentiles are popular because they give an exact sample (or one within a guaranteed error rate) for a timing or latency value. This is more accurate than a histogram can provide and makes it easier to calculate metrics such as SLA adherence.

However, this accuracy comes with trade-offs.

It's not recommended to use percentiles to calculate summaries if it's possible to use a histogram instead. This is because you can't aggregate percentiles, and it's often hard to determine the time frame summaries cover, and Prometheus and StatsD client libraries implement this in different ways.

After percentiles are calculated, you can't merge with another percentile and maintain any statistical accuracy. For a more detailed explanation, read this article (opens in a new tab).

For example, the average of two p95 values doesn't equal the p95 for the combined set of values. This is a common problem in practice as users tend to emit timer metrics with many dimensions (labels), each of which produces a unique metric series that can't combine with any other metric series accurately.

To solve this problem, emit a different time metric for each combination of labels required in a query. However, this results in even more resource consumption, and there are always limits. Timers emitted from different data centers or regions can never be accurately combined. Because of this, global latencies aren't possible with timer percentiles.

Percentiles are the most accurate way to measure latencies and timers but are only recommended when there is no need to aggregate values across multiple instances of a service (or multiple clients).

Histograms

Histograms are by far the best-performing way of monitoring latencies and the most accurate way to aggregate across latencies.

Histograms define buckets with ranges, and the client records how many values fall within particular buckets.

Histograms perform better than percentiles because they require only one count per bucket, and histograms can be accurately aggregated across time series or clients provided they have the same buckets configured.

This means that you can accurately aggregate histograms across multiple clients or regions without having to emit additional time series for aggregate views.

You can use histograms for accurate calculations, such as SLA adherence. However, you need to ensure you define the SLA as one of the boundary values of one of the histogram buckets.

The downside of histograms are that they require you to think about the expected timer or latency ranges ahead of time and configure buckets accordingly as you can't change histogram buckets after configuring them.

You can configure clients to create histograms with pre-configured latency bucket sets.

For example:

  • low latency (1 microsecond to 1 second)
  • medium latency (1 millisecond to 10 seconds)
  • high latency (1 second to 1 hour)

The client can also provide the user easier ways to create buckets, such as by using linear buckets log-linear buckets.

In summary, histograms are the best-performing way of monitoring latencies and the most accurate way of aggregating latencies across clients. However, they can't provide the exact percentile sample.

Here is an example of how histograms and the conversion to percentiles:

  • The first graph shows the cumulative histogram buckets. A majority of the values (75%) are in the 0.002 second bucket, and most are in the 0.004 second bucket, with everything being less than 9 seconds (the topmost line in the first graph).
  • The second graph is the p75 computed from the histogram, and is 0.002.
  • The third graph shows the p99, which turns out to be about 0.004, as most values are within that.

timer examples

Clients

The Prometheus client provides support for both percentiles (called summaries) and histograms.

  • Summaries: Calculate different quantiles or percentiles across a set of measurements before reporting them to the metrics server. They provide per-instance percentile information
  • Histograms: Put measurements into configurable buckets that can be aggregated across all instances to provide percentile numbers.