Query sources
In Observability Platform, query demand comes from either an automated or a manual source: Automated sources, such as monitors and recording rules with regular evaluation intervals, produce a relatively predictable demand on the query resources at the database and at the browser. Queries from an automated source often request a smaller set of data than something like a dashboard query, but might incur high request volume. Although each individual query is small, the amount of query resources they’re demanding at regular intervals is very large. Manual sources, such as loading a dashboard, running a query in Metrics Explorer, or making a direct API call, place query demands on query resources at both the database and browser that are cyclical or spiky. Queries from a manual source are often exploratory, or retrieve longer time periods than automated queries. The load these queries place on the system varies significantly with the scale of data returned. Queries requesting more time series and data points require more resources to retrieve information from the database, and also can require more resources to deliver that information to the browser. Individual queries of either type demand different query resources depending on the amount of data retrieved from the database, amount of data returned the browser, or both. For example:- A query that places high load on the database but low load on the browser might request the sum of a large set of data. The query retrieves all relevant data points from the database (high database query load), but only returns a few summed data points to the browser for a system’s performance monitoring workflow.
- A query that places high load on both the database and browser might request the raw, unaggregated data points from all services (many unique time series) to return to the browser, such as for a debugging workflow.
Automated source query limits
Observability Platform includes the Metrics Query Capacity Overview dashboard to measure your automated query consumption against system capacity. Use this dashboard to understand how much query capacity remains as part of your budget. The dashboard includes queries from monitors, recording rules, and service accounts in the reporting metrics. Observability Platform uses selectors per second and data reads per second metrics to track automated source query limits. Exceeding either of these limits results in dropping data with an HTTP429 Too Many Requests status code. Observability Platform continues dropping an
indiscriminate subset of incoming queries to keep query traffic within defined
limits.
To guard against prematurely dropping queries when unforeseen spikes occur,
Observability Platform doesn’t start dropping data until either of the query limits
are exceeded for 10 minutes consecutively.
Selectors per second
These metrics track the count of selectors that queries issue per second. For example, this metric includes a single selector namedup:
sum selectors:
The selectors per second query limit can’t be increased.
Reduce selectors per second query load
Use the following strategies to reduce query load for selectors per second:-
Configure longer intervals for monitors and
recording rules.
For example, consider 1,000 monitor queries that include a single selector, each of
which runs every 15 seconds. These figures calculate to roughly 66 selectors per
second (
1000/15). Increasing the execution interval to 60 seconds reduces the number of selectors per second to roughly 16 (1000/60). -
Structure alert and monitor queries to use PromQL aggregations for optimal
efficiency. For example, you might create two separate alerts for service A and
service B error rates:
Instead, create a single monitor that checks both services:
Data reads per second
These metrics track the amount of raw data that Observability Platform fetched to run the specified queries per second. Observability Platform calculates data consumed by a query with this calculation:DATAPOINTS_RETURNED_FOR_SERIESis the number of data points retrieved for query execution.
60.
Examples
Consider a range querysum(up{app="webserver"}) with a time range of
[now()-1h, now()]. If the query selects ten steady series with a scrape interval of
30 seconds, the total number of retrieved data points is 1,200, as defined by this
equation:
up{app="database"} that selects 20 time series with
a scrape interval of one minute. Prometheus fetches five minutes of data by default,
so the resulting number of fetched data points is 100, as defined by the following
equation.
However, this equation doesn’t apply the minimum number of data points retrieved per
series, which is 60, so this equation isn’t correct.
60) of data points retrieved per series.
Reduce data reads per second
Use the following strategies to reduce the consumption of data reads per second:- Write queries with a shorter time range or shorter range selectors.
- Create rollup rules for unnecessary labels in your most expensive queries to reduce the amount of data points the query needs to fetch.
-
Ensure that query selectors are as precise as possible, especially in join queries.
For example, consider this query:
Including
{cluster="prod"}in the latter part of the query reduces the number of series and in turn data points the query needs to fetch for the right side of the join. Use this instead:
Optimize queries
Large queries might be good candidates for optimization by decreasing the amount of data the query is trying to retrieve. Use the Metrics Query Capacity Overview to identify large queries. Optimize queries by:- Shortening the time window, or using rollup rules to decrease the data scale.
- Improving the query syntax by removing regular expressions to reduce the index lookups performed by the query.
- Adjusting the number of concurrent unique requests in the system.
Log query rate limits
When querying log data, be mindful of the following limits that can impact your search experience, what happens if your tenant exceeds limits, and how you can avoid hitting these limits.- Global query limit: Each Observability Platform tenant has a global limit on the total compute resources available for queries. If this limit is reached, queries are queued and will run as soon as resources become available.
- User query limit: This limit prevents a single person from consuming all available query resources. The user query limit is calculated based on the compute resources required to run individual queries in addition to the number of user queries that are run. Each user is limited to a percentage of the total available query resources.
- API query limit: API queries are limited to a percentage of the total available query resources.
Avoid log query rate limits
If you experience slow or lagging queries on your log data, implement the following changes to help increase the speed of your queries:- Close unneeded tabs in your browser. Open tabs with long-running queries or that refresh often, such as in Logs Explorer or in dashboards, consume more resources.
- Write efficient queries. Ensure your log queries are efficient so that they complete faster and make resources available for other queries to run.
- Include the primary key in your query: Query on one of the predefined
keys, such as
serviceorseverity, and then use the query syntax to incorporate additional attributes. - Narrow the query scope. Reduce the time range or scope of your query to return results faster.
Query truncation at the browser
Observability Platform can truncate queries that return many unique time series or data points to the browser to reduce the chances of long load times and query timeouts. Truncating queries protects the browser from crashing when querying many metrics. Dashboards and the Metrics Explorer both use query truncation. Observability Platform calculates the browser limit using a combination of time series, time granularity, and time period (and their resulting data points) requested by the query. The limits are:- Time series a query can return to the browser: 2,500
- Data points a query can return to the browser: 300,000

Enable or disable query truncation
By default, Observability Platform enables query truncation. To disable or re-enable query truncation:- In the displayed dialog, select Options.
- Click the Truncate expensive query results toggle.
- Click Done.
Disabling query truncation causes expensive queries to take longer to load and can
result in timeouts (no results returned for a single query), or might cause the
browser to crash.

Query protections in the database
Observability Platform employs protections against very large queries in the database layer, in addition to the browser. A query that requests many unique time series or data points from the database can experience these situations:- Per-query scale protections
- Across-query resource balancing protections
- Individual query timeouts
- Individual query scale protections are in place.
- There are many queries sharing scarce query resources (query balancing).
- An individual query reached a timeout.
Query scale protections
Query scale protections ensure a positive user experience for the maximum number of users. Current database query scale protections are:- Time series a query can retrieve: 300,000
-
Standard data points a query can retrieve: 200,000,000
For Chronosphere histograms, the query limit depends on the number of buckets containing data within a histogram.
Query resource balancing
Observability Platform tracks the number of concurrent queries received, the scale of each query, and the unique users for each query. Observability Platform makes an effort to fairly balance the available query resources between large and small queries, and also between unique users. If there are many concurrent queries, and each has relatively high scale, some queries might need to complete in stages, and must wait (be throttled) to allow queries from other unique users to make use of the available resources. The more unique users, the more balanced the system is in enabling different queries to complete.Query timeouts
Even with balancing measures in place, if there are insufficient resources for a query to complete within a specified time period, the query will time out. Observability Platform displays an error message indicating the query timed out. If a query times out, try these methods to resolve the issue:- Modify the query syntax to retrieve less data per query. For example, try to shorten the time range.
- Use shaping rules such as aggregation rules to downsample or roll up unnecessary labels, and then modify the query to retrieve aggregated data points instead of the raw data points.