Filtering Out Duplicate Records
The simplest way to remove duplicate records is to add a Filter stage to your dataflow after a matching stage. The Filter stage removes records from collections of duplicate records based on the settings you specify.
-
In Enterprise Designer, create a dataflow that identifies duplicate records through matching.
Matching is the first step in deduplication because you need to identify records that are similar, such as records that have the same account number or name. See the following topics for instructions on creating a dataflow that matches records.
- Matching Records from a Single Source
- Matching Records from One Source to Another Source
- Matching Records Against a Database
Note: You only need to build the dataflow to the point where it reads data and performs matching with an Interflow Match, Intraflow Match, or Transactional Match stage. Once you have created a dataflow to this point, continue with the following steps. -
Once you have defined a dataflow that reads data and matches records, drag a Filter stage to the canvas and connect it to the stage that performs the matching (Interflow Match, Intraflow Match, or Transactional Match).
For example, if your dataflow reads data from a file and performs matching with Intraflow Match, your dataflow would look like this after adding a Filter stage:
- Double-click the Filter stage on the canvas.
- In the Group by field, select CollectionNumber.
- Leave the option Limit number of returned duplicate records selected and the value set to 1. These are the default settings.
-
Decide if you want to keep the first record in each collection, or if you want to define a rule to choose which record from each collection to keep. If you want to keep the first record in each collection, skip this step. If you want to define a rule, in the rule tree, select Rules then follow these steps:
- Click OK to close the Filter Options window.
-
Drag a sink stage onto the canvas and connect it to the Filter stage.
For example, if you were using a Write to File sink stage your dataflow would look like this:
-
Double-click the sink stage and configure it.
For information on configuring sink stages, see the Dataflow Designer's Guide.
You now have a dataflow that identifies matching records and removes all but one record for each group of duplicates, resulting in an output file that contains deduplicated data.