Data
Deduplicate records

Deduplicate records

The deduplicate records processing rule looks for any records that contain identical key/value data within a specified time frame, then removes all but the earliest of those records.

Configuration parameters

⚠️

When the deduplicate records rule checks for duplicate records, it stores those records in memory. Increasing the value of the Time window parameter also increases the memory load on your pipeline. For example, if 100,000 records pass through your pipeline during the specified time period, and those records are 1 KB each, the deduplicate records rule will add approximately 100 MB of memory load.

  • Time window: Required. The length of the time window, in seconds, to compare duplicate records. For example, for a window length of 5, two records with an identical key/value pair are considered duplicates if they both occur within a five-second period, but not if they occur within a ten-second period.
  • Select key: Required. The key to use in your comparison. If multiple records have the same value assigned to this key, this rule removes all but the earliest record to contain that key/value pair within the specified time frame. You can also use record accessor syntax to reference keys nested within another nested object.
  • Ignore records without key: If enabled, the deduplicate records rule skips any records that don't contain your specified Select key. Chronosphere recommends enabling this setting to prevent processing errors.
  • Comment: A custom note or description of the rule's function. This text is displayed next to the rule's name in the Actions list in the processing rules interface.

Example

Using the deduplicate records rule lets you remove redundant information from your pipeline and reduce the amount of data that reaches your backend.

For example, given the following sample log data:

{"message": "All endpoints are functional."}
{"message": "All endpoints are functional."}
{"message": "All endpoints are functional."}
{"message": "All endpoints are functional."}
{"message": "The /purchase endpoint is unavailable."}
{"message": "The /purchase endpoint is unavailable."}
{"message": "The /purchase endpoint is unavailable."}
{"message": "The /purchase endpoint is partly unavailable."}
{"message": "The /purchase endpoint has been reset."}
{"message": "All endpoints are functional."}
{"message": "All endpoints are functional."}

A processing rule with the Time window value 5 and the Source key value message returns the following result:

{"message":"All endpoints are functional."}
{"message":"The /purchase endpoint is unavailable."}
{"message":"The /purchase endpoint is partly unavailable."}
{"message":"The /purchase endpoint has been reset."}
{"message":"All endpoints are functional."}

This rule removed all but the first instance of any logs with identical message values that appeared within the specified time frame. Because more than five seconds elapsed between the value All endpoints are functional on line 1 and the same value on line 10, this rule retained both the log on line 1 and the log on line 10. However, since fewer than five seconds elapsed between the value All endpoints are functional on line 10 and the same value on line 11, this rule removed the log on line 11.