Random sampling
The random sampling processing rule preserves a percentage of records that pass through your pipeline and discards the rest. These records are chosen at random from the total number of records that accumulate in your pipeline between sampling intervals.
When the random sampling rule waits for data to accumulate in your pipeline, it stores that data in memory. Increasing the value of the Time window parameter also increases the memory load on your pipeline. For example, if 100,000 records pass through your pipeline during the specified time period, and those records are 1 kB each, the random sampling rule will add approximately 100 MB of memory load.
Configuration parameters
Parameter | Description | Default |
---|---|---|
Time window | Required. How long to wait (in seconds) for data to accumulate in your pipeline before taking a sample. Depending on how quickly data accumulates in your pipeline, increasing or decreasing the time between samples can affect how much data is preserved. | none |
Sample % | Required. The percentage of data to preserve. Within each batch of accumulated data, the individual records to preserve are chosen randomly, and the rest are discarded. This value must be a positive integer between 1 and 100 . | none |
Seed for random number generator | A seed to affect the random number generator that determines which records to preserve. This value must be a positive integer. | none |
Comment | A custom note or description of the rule's function. This text is displayed next to the rule's name in the Actions list in the processing rules interface. | none |
Example
Using the random sampling rule lets you reduce the size of your telemetry data while still retaining a general snapshot of events that occur during a specified time frame. For example, given this sample website log data:
{"page_id":9,"action":"view"}
{"page_id":1,"action":"purchase"}
{"page_id":20,"action":"view"}
{"page_id":14,"action":"click"}
{"page_id":9,"action":"click"}
{"page_id":5,"action":"click"}
{"page_id":14,"action":"purchase"}
{"page_id":16,"action":"purchase"}
{"page_id":14,"action":"click"}
{"page_id":2,"action":"view"}
{"page_id":14,"action":"click"}
{"page_id":11,"action":"click"}
{"page_id":13,"action":"click"}
{"page_id":8,"action":"click"}
{"page_id":20,"action":"purchase"}
{"page_id":4,"action":"click"}
{"page_id":17,"action":"view"}
{"page_id":2,"action":"view"}
{"page_id":15,"action":"click"}
{"page_id":15,"action":"purchase"}
{"page_id":11,"action":"purchase"}
{"page_id":13,"action":"view"}
{"page_id":1,"action":"click"}
{"page_id":15,"action":"click"}
{"page_id":1,"action":"click"}
{"page_id":3,"action":"purchase"}
{"page_id":18,"action":"purchase"}
{"page_id":11,"action":"purchase"}
{"page_id":11,"action":"view"}
{"page_id":12,"action":"click"}
A processing rule with the Time window value 60
and the Sample % value
10
returns the following result:
{"action":"click","page_id":15}
{"action":"purchase","page_id":15}
{"action":"click","page_id":12}
This rule retained 10% of logs that accumulated within a 60-second time frame. Since all of the sample logs accumulated within this time frame, and the original sample contained 30 logs, three random logs were retained and the other 27 logs were discarded.