OBSERVABILITY PLATFORM
Trace sampling

Sample your traces

Distributed traces provide an additional layer of context for solving problems across complex systems that include hundreds or thousands of microservices. However, you want to ensure you're only ingesting tracing data that's relevant and valuable. To help control costs and maximize the usefulness of your tracing data, you can use the Trace Control Plane to narrow your focus to only a representative sample of your data and drop everything else.

To access the Trace Control Plane, in the navigation menu select Control > Trace Control Plane.

Head and tail sampling

The best-known strategies for sampling trace data are head sampling and tail sampling:

  • Head sampling is a more blunt strategy that seeks to make a sampling decision as early as possible. Head sampling evaluates only a defined percentage of traces to take a representative sample of whole traces.

  • Tail sampling is more fine-grained, and evaluates every trace after assembling all spans. Tail sampling rules can consider request outcomes, such as whether a request succeeded or how long it took to complete, which isn't possible with head sampling.

Creating and managing head and tail sampling rules can be challenging to ensure you're discarding and keeping the most impactful data. To simplify this process and decrease the learning curve of sampling, Chronosphere developed two concepts to group, track, and apply sampling rules: datasets and behaviors.

Datasets

Create datasets to map sets of traces to named groups relevant to your organization so you can track processed and persisted bytes for those groups over time. Datasets don't impact your license consumption, so you can experiment with creating datasets to understand your license usage and make changes as needed without consuming a portion of your license. With datasets in place, you can then apply behaviors to your datasets.

Behaviors

After creating datasets for individual business units, you can apply behaviors to your datasets to set sampling rates without needing to write and manage large sets of fine-grained sampling rules.

You set a baseline behavior that implements data-driven best practices with default parameters. You can modify those parameters based on the needs of your organization. For example, modify the defined criteria to drop low-value traces as quickly as possible and keep high-value traces at a specified rate for one or more datasets from a single behavior.

You can also set a behavior to allow (sample at 100%) or deny (sample at 0%) all traces for a specific period. For example, set an allow behavior when you need to increase the amount of high-fidelity data during a deploy, or when debugging issues. Alternatively, set a deny behavior when you want to decrease the amount of noisy or spam traces to keep your budget spend within limits.

Get started with sampling

Complete the following steps to get started with trace sampling in Observability Platform.

  1. Instrument head sampling, which is a prerequisite for using behaviors.

    Head sampling drives total trace volume for each root service and operation. Use the Trace Control plane to manage head sampling as part of behaviors.

  2. Optional: Create head and tail sampling rules:

    You create these rules for your traces if you don't want to use behaviors for sampling management, or if you already have head and tail sampling rules you want to migrate to and manage within Observability Platform.

  3. Use the incomplete traces dataset to identify incomplete traces.

    Incomplete traces are those where one or more spans lacks a parent span, like from a lack of instrumentation.

  4. Recommended: Improve local instrumentation to ensure a higher volume of complete traces before creating independent datasets.

  5. Create a dataset per team or per environment to track the exact volume of trace data for that team or environment over time.

  6. Assign a baseline behavior to your datasets to ensure that you capture all of your most meaningful traces (such as slow traces and error traces) for incident response purposes.

    Use the baseline behavior to optionally capture fewer of your lower-priority traces (such as fast traces and successful traces) to obtain a system baseline.

As you learn more about your trace data, you can edit facets of the baseline behavior to modify your sampling strategy and more clearly define which traces to drop and which to keep. In most cases, you want to assertively drop less interesting, lower-value traces and keep more interesting, higher-value traces. In the Observability Platform app, you can modify the baseline sampling strategy and apply it across one or more datasets.