Input Parameters

Parameter Description
Group-By Option For a MapReduce job, pass these arguments:
GroupBy Column
The name of the column using which the records are to be grouped.
Number of Reducer Tasks
The number of reducer tasks required to group the records.
For a Spark job, to create a Group-By option pass these arguments:
GroupBy Column
The name of the column using which the records are to be grouped.
Match Rule Defines as many parent and child rules as required, to create a MatchRule object.

For more information, see MatchRule.

Candidate File For text files:
File Path
The path of the candidate text file on the Hadoop platform.
Record Separator
The record separator used in the candidate file.
Field Separator
The separator used between any two consecutive fields of a record, in the candidate file.
Text Qualifier
The character used to surround text values in a delimited file.
Header Row Fields
An array of the header fields of the candidate file.
Skip First Row
Flag to indicate if the first row must be skipped while reading the suspect file records.

This must be true in case the first row is a header row.

Attention: Invoke the appropriate constructor of FilePath.
For ORC format files:
ORC File Path
The path of the input ORC format file on the Hadoop platform.
Important: The suspect and candidate files must be of the same format. Either text files or ORC format files.
Common parameters:
Field Mappings
A map of key value pairs, with the existing column names as the keys and the desired output column names as the values.
Suspect File For text files:
File Path
The path of the suspect text file on the Hadoop platform.
Record Separator
The record separator used in the suspect file.
Field Separator
The separator used between any two consecutive fields of a record, in the suspect file.
Text Qualifier
The character used to surround text values in a delimited file.
Header Row Fields
An array of the header fields of the suspect file.
Skip First Row
Flag to indicate if the first row must be skipped while reading the suspect file records.

This must be true in case the first row is a header row.

Attention: Invoke the appropriate constructor of FilePath.
For ORC format files:
ORC File Path
The path of the input ORC format file on the Hadoop platform.
Common parameters:
Field Mappings
A map of key value pairs, with the existing column names as the keys and the desired output column names as the values.
Output File For text files:
File Path
The path of the output text file on the Hadoop platform.
Field Separator
The separator used between any two consecutive fields of a record, in the output file.
Attention: Invoke the appropriate constructor of FilePath.
For ORC format files:
ORC File Path
The path of the output ORC format file on the Hadoop platform.
For Parquet format files:
Parquet File Path
The path of the output Parquet format file on the Hadoop platform.
Common Parameters:
Overwrite
Flag to indicate if output file must overwrite any existing file of same name.
Create Output Header
Flag to indicate if header file is to be created on the Hadoop server or not.
Job Configurations The Hadoop configurations for the job.

For a MapReduce job, the instance must be of type MRJobConfig. For a Spark job, the instance must be of type SparkJobConfig.

Match Key Settings A combination of the columns and the algorithms to be applied to generate the match key, required to perform the matching.
Note: Specify only one match key.
Attention: Set the match key settings only if you wish to generate a match key before performing the matching.
Job Name The name of the job.
Express Match Column The name of the column to be used for express matching of records.
Setting Collection Number Zero to Unique Records Set this to true to set the collection number of unique records as 0 (zero).
Comparison Option Allows you to select one of the two options:
  • Compare the Suspect record to all Candidate records: Specify whether unique records must be returned in the output or not.
  • Compare the Suspect record to the selected Candidate record only: Specify the maximum number of duplicate records to be searched and returned.
Compress Output Flag to indicate if the output must be compressed.

Set this to true to compress the output.