Examples for tracing

Use Chronosphere tracing when you want to locate a service operation causing latency issues to other services that rely on it. The following examples use data from the OpenTelemetry Astronomy Shop Demo (opens in a new tab), which is an open source, microservice-based distributed system that illustrates the implementation of OpenTelemetry in a near real-world environment.

On-call triage

The following example highlights annotations, which you can use to link to tracing data from a dashboard. The example assumes that you received a notification from an alert that triggered for a monitor.

  1. You click a link in the notification, which directs you to the Order Service Latency monitor. This monitor tracks requests and errors for the ordering-svc service.

    On the Order Service Latency monitor, in the Query Results chart, you notice a continual spike in queries to the ordering-svc service.

  2. In the Annotations section of the monitor, you click a link to a dashboard.

  3. In the Order Service Overview dashboard, you notice a wave of spikes in requests to the /ordering.Ordering/Checkout operation of the ordering-svc service.

  4. On the Requests chart, click any point and then click Query Traces.

    The link opens Trace Explorer with a predefined search query that includes the service and operation you want to explore.

  5. On the Trace Explorer page, click the Topology View tab to view a mapping of affected upstream and downstream services.

  6. In the Search Services box, enter ordering-svc to scope the view to that service.

  7. Click the ordering-svc node to display details.

    In the Node Details panel, you see 176 errors incoming and 119 errors outgoing connected to the ordering-svc service. As you zoom in on the topology view, you notice that the edge connecting to the billing-svc service is thicker than the others.

  8. Click the billing-svc.

    In the Node Details panel for the billing-svc, you notice that outgoing requests to the payment-gateway-svc are high.

  9. In the Node Details panel, click Include to include the billing-svc in your search query. Your search query now includes:

    • operation:/ordering.Ordering/Checkout
    • service:ordering-svc
    • service:billing-svc

    You determined that the billing-svc service is generating the most errors, which is also impacting the payment-gateway-svc service.

  10. On the Trace Explorer page, click Create Metric to create a trace metric for detecting future issues with the billing-svc service.

    Other on-call engineers can use this trace metric to open a predefined query in Trace Explorer and help reduce the time to identify and fix issues with this service.

Service exploration

The Services page provides efficient views into your services to help you discover ways of exploring your data. You can link directly to tracing data related to a specific service from the Services page. Chronosphere automatically generates these links to help you monitor services and connected telemetry.

  1. In the navigation menu, select Services.

  2. In the My services table, your Deployer Service has a currently alerting monitor that exceeds the defined critical conditions.

  3. Select the Deployer Service to display its individual service page.

  4. In the Dependency Map, you notice that this service has downstream errors.

  5. In the Related Queries section, click View Traces to open Trace Explorer with the context defined in the service page.

  6. In the Trace Explorer page, in the Statistics section, select Leaf Errors in the Metric dropdown menu to highlight services that include error spans with no failing child spans.

    These errors are the deepest errors within a request flow, and are often the reason why an entire trace fails.

  7. In the sparklines chart, you notice spikes in leaf errors for the gatewayauth service.

  8. In the Group by field, enter Operation so that you're grouping results by both Service and Operation.

    In the Statistics table, you see that the auth.Auth/Authenticate operation of the gatewayauth has the most leaf errors.

  9. Click the gatewayauth service to add it and the auth.Auth/Authenticate operation to your search filter.

  10. At the top of the page, click the link icon to copy a link with the defined filter criteria that you can share with team owning that owns the gatewayauth service.

You identified the service and operation with the most leaf errors and can send a contextual link to the team responsible for that service. By focusing on leaf errors, you located the root issue impacting related traces and can provide that context to the owning team.

Start with trace data

The following example begins in Trace Explorer. Maybe you navigated here from Trace Metrics, a dashboard, or a monitor, and now you're exploring trace data to identify where issues are occurring.

  1. In the navigation menu select Exploring > Trace Explorer.

  2. Set the Time Window to Within last and 30 minutes.

  3. From the Showing Error States dropdown, click Only errors.

    This search returns too many traces to narrow down the issue. You think the issue relates to the frontend service, but don't know which related operation is the culprit. Modify the search criteria to narrow your search.

  4. In the Search Summary field, enter frontend and then click that service from the search results.

    Your search narrows the results and scope to only spans that include the frontend service. On the Statistics tab, you notice that the loadgenerator service has a high error rate.

  5. On the Statistics tab under the Error Percentage column, click loadgenerator, and then click Include in Span Filter in the resulting dialog to add the loadgenerator service to your search query.

    You know that the loadgenerator service is contributing to your trace latency, but still aren't sure what the main issue is.

  6. Click the Traces tab to view a list of the most relevant traces for your search.

  7. In the Trace column, click loadgenerator > HTTP GET to display the trace details for that service and operation combination.

    You notice errors in operations for two additional services related to the loadgenerator service. The GET operation on both the loadgenerator and frontend services have high latency.

  8. Click the frontend service, which updates the Span Details panel with information specific to that service and operation combination.

    You now have detailed information about the specific services and operations causing latency issues.

    In LINKS, click + Add Link to add a templated link to your external logging service, which provides other users access to the logs related to this span.

    In PROCESS, you identify k8s.pod.name, which is the Kubernetes pod the GET request originates. You can begin investigating that specific operation to remediate the issue.

  9. To the right of the value for k8s.pod.name, click the more icon and then click Add to Filter to add the value of that process to your search query.

  10. On the Trace Explorer page, you can click Create Metric to create a trace metric based on your updated search. You can use trace metrics to create dashboards and monitors for key metrics that you want to track and get alerts for.