Random sampling

TELEMETRY PIPELINE

Random sampling

The random sampling processing rule preserves a percentage of records that pass through your pipeline and discards the rest. These records are chosen at random from the total number of records that accumulate in your pipeline between sampling intervals.

⚠️

When the random sampling rule waits for data to accumulate in your pipeline, it stores that data in memory. Increasing the value of the Time window parameter also increases the memory load on your pipeline. For example, if 100,000 records pass through your pipeline during the specified time period, and those records are 1 kB each, the random sampling rule will add approximately 100 MB of memory load.

Configuration parameters

Use the parameters in this section to configure this processing rule. The Telemetry Pipeline web interface uses the items in the Name column to describe these parameters. Pipeline configuration files use the items in the Key column as YAML keys.

Name	Key	Description	Default
Time window in seconds	`window`	Required. How long to wait (in seconds) for data to accumulate in your pipeline before taking a sample. Depending on how quickly data accumulates in your pipeline, increasing or decreasing the time between samples can affect how much data is preserved.	none
Sample %	`percentage`	Required. The percentage of data to preserve. Within each batch of accumulated data, the individual records to preserve are chosen randomly, and the rest are discarded. This value must be a positive integer between `1` and `100`.	none
Seed for random number generator	`seed`	A seed to affect the random number generator that determines which records to preserve. This value must be a positive integer.	none
Comment	`comment`	A custom note or description of the rule's function. This text is displayed next to the rule's name in the Actions list in the processing rules interface.	none

Example

Using the random sampling rule lets you reduce the size of your telemetry data while still retaining a general snapshot of events that occur during a specified time frame. For example, given this sample website log data:

{"page_id":9,"action":"view"}
{"page_id":1,"action":"purchase"}
{"page_id":20,"action":"view"}
{"page_id":14,"action":"click"}
{"page_id":9,"action":"click"}
{"page_id":5,"action":"click"}
{"page_id":14,"action":"purchase"}
{"page_id":16,"action":"purchase"}
{"page_id":14,"action":"click"}
{"page_id":2,"action":"view"}
{"page_id":14,"action":"click"}
{"page_id":11,"action":"click"}
{"page_id":13,"action":"click"}
{"page_id":8,"action":"click"}
{"page_id":20,"action":"purchase"}
{"page_id":4,"action":"click"}
{"page_id":17,"action":"view"}
{"page_id":2,"action":"view"}
{"page_id":15,"action":"click"}
{"page_id":15,"action":"purchase"}
{"page_id":11,"action":"purchase"}
{"page_id":13,"action":"view"}
{"page_id":1,"action":"click"}
{"page_id":15,"action":"click"}
{"page_id":1,"action":"click"}
{"page_id":3,"action":"purchase"}
{"page_id":18,"action":"purchase"}
{"page_id":11,"action":"purchase"}
{"page_id":11,"action":"view"}
{"page_id":12,"action":"click"}

A processing rule with the Time window in seconds value 60 and the Sample % value 10 returns the following result:

{"action":"click","page_id":15}
{"action":"purchase","page_id":15}
{"action":"click","page_id":12}

This rule retained 10% of logs that accumulated within a 60-second time frame. Since all of the sample logs accumulated within this time frame, and the original sample contained 30 logs, three random logs were retained and the other 27 logs were discarded.

Parse number Redact/mask value