Deduplicate records
The deduplicate records processing rule looks for any records that contain identical key/value data within a specified time frame, then removes all but the earliest of those records.
When the deduplicate records waits for data to accumulate in your pipeline, it stores that data in memory. Increasing the value of the Time window parameter also increases the memory load on your pipeline. For example, if 100,000 records pass through your pipeline during the specified time period, and those records are 1 kB each, the deduplicate records rule will add approximately 100 MB of memory load.
Configuration parameters
Parameter | Description | Default |
---|---|---|
Time window | Required. How long to wait (in seconds) for data to accumulate in your pipeline before searching for duplicate records. For example, for a window length of 5 , two records with an identical key/value pair are considered duplicates if they both occur within the same five-second period, but not if they occur within the same 10-second period. | none |
Select key | Required. The key to use in your comparison. If multiple records have the same value assigned to this key, this rule removes all but the earliest record to contain that key/value pair within the specified time frame. You can also use record accessor syntax to reference keys nested within another nested object. | none |
Ignore records without key checkbox | Indicates whether to skip any records that don't contain your specified Select key. Chronosphere recommends selecting this checkbox to prevent processing errors. | Selected |
Comment | A custom note or description of the rule's function. This text is displayed next to the rule's name in the Actions list in the processing rules interface. | none |
Example
Using the deduplicate records rule lets you remove redundant information from your pipeline and reduce the amount of data that reaches your backend.
For example, given the following sample log data:
{"message": "All endpoints are functional."}
{"message": "All endpoints are functional."}
{"message": "All endpoints are functional."}
{"message": "All endpoints are functional."}
{"message": "The /purchase endpoint is unavailable."}
{"message": "The /purchase endpoint is unavailable."}
{"message": "The /purchase endpoint is unavailable."}
{"message": "The /purchase endpoint is partly unavailable."}
{"message": "The /purchase endpoint has been reset."}
{"message": "All endpoints are functional."}
{"message": "All endpoints are functional."}
A processing rule with the Time window value 5
and the Source key value
message
returns the following result:
{"message":"All endpoints are functional."}
{"message":"The /purchase endpoint is unavailable."}
{"message":"The /purchase endpoint is partly unavailable."}
{"message":"The /purchase endpoint has been reset."}
{"message":"All endpoints are functional."}
This rule removed all but the first instance of any logs with identical
message
values that appeared within the specified time frame. Because more
than five seconds elapsed between the value All endpoints are functional
on
line 1 and the same value on line 10, this rule retained both the log on line 1
and the log on line 10. However, since fewer than five seconds elapsed between
the value All endpoints are functional
on line 10 and the same value on line 11,
this rule removed the log on line 11.