Selecting Columns

On this page, columns of your data are displayed in a tabular format. You must select the columns from your data that must participate in the model training and best of breed rule creation and select the fields for which you want the data to be merged.

Select the Column Name check-box for the columns that must participate in the best of breed rule creation.
If you want to merge a field, toggle Merge to YES.

Note: Make sure that you always select the corresponding check-box of the field you wish to merge.
Select a desired semantic type from the drop-down after selecting the corresponding check-box of that column. By default, NONE is displayed.

Based on the selected columns, groups of records are automatically generated, and these are displayed on the next page for tagging. The generated groups cover all the variations in the data using advanced and smart algorithms and techniques. For example, say if your original input file contained 5000 groups, the system might show only 50 groups covering all the variations.

Note: By default, the maximum collection size limit is 10, and the groups larger than that are excluded from the consolidation process.

The purpose of generating variations is to identify a small subset of collections for tagging, which covers most of the unique variations in source data. It's like picking up few collections from a large set of collections representing the complete set so that tagging on this subset will provide the best of breed rule close to one we would have got by tagging the entire collection set.

The variations are generated based on operations that we have in the Best Of Breed stage.


BOB Operator	Based on Feature
Most Common	Frequency
Longest/Shortest	Length
Highest/Lowest	Rank
Greater/Less Than	Absolute values
Equals/Not Equals	It is based on finding the values which are category-specific and using the obtained values as a feature.
Empty/Not Empty	Frequency

Note: By default, the field Collection number, which is a mandatory field, is auto-selected and disabled. The collection number identifies each duplicate record in a match queue, and if the candidate is a duplicate, it is assigned a collection number.