Using an Interflow Match MapReduce Job

  1. Create an instance of AdvanceMatchFactory, using its static method getInstance().
  2. Provide the input and output details for the Interflow Match job by creating an instance of InterMatchDetail specifying the ProcessType. The instance must use the type MRProcessType.
    1. Specify the column using which the records are to be grouped by creating an instance of GroupbyOption.
      Use an instance of GroupbyMROption to specify the group-by column and the number of reducers required.
    2. Generate the matching rules for the job by creating an instance of MatchRule.
    3. Create an instance of InterMatchDetail, by passing an instance of type JobConfig, the GroupbyOption instance created, and the MatchRule instance created above as the arguments to its constructor.
      The JobConfig parameter must be an instance of type MRJobConfig.
    4. Set the details of the candidate file using the candidateFilePath field of the InterMatchDetail instance.
      For a text candidate file, create an instance of FilePath with the relevant details of the candidate file by invoking the appropriate constructor. For an ORC candidate file, create an instance of OrcFilePath with the path of the ORC candidate file as the argument.
    5. Set the details of the suspect file using the suspectFilePath field of the InterMatchDetail instance.
      For a text suspect file, create an instance of FilePath with the relevant details of the suspect file by invoking the appropriate constructor. For an ORC suspect file, create an instance of OrcFilePath with the path of the ORC suspect file as the argument. For a parquet suspect file, create an instance of ParquetFilePath with the path of the parquet suspect file as the argument.
      Important: The suspect and candidate files must be of the same format. Either text files or ORC format files.
    6. Set the details of the output file using the outputPath field of the InterMatchDetail instance.
      • For a text output file, create an instance of FilePath with the relevant details of the output file by invoking the appropriate constructor.
      • For an ORC output file, create an instance of OrcFilePath with the path of the ORC output file as the argument.
      • For a Parquet output file, create an instance of ParquetFilePath with the path of the Parquet output file as the argument.
    7. Set the name of the job using the jobName field of the InterMatchDetail instance.
    8. Set the Express Match Column using the expressMatchColumn field of the InterMatchDetail instance, if required.
    9. Set the flag collectionNumberZerotoUniqueRecords of the InterMatchDetail instance to true to allocate the collection number 0 (zero) to a unique record. The default is true.
      If you do not wish to allocate the collection number zero to unique records, set this flag to false.
    10. Set the comparison option using the comparisonOption field of the InterMatchDetail instance. In this field, set the required value using the class InterMatchComparisonOption to select one of the two options:
      • Compare the Suspect record to all Candidate records: Specify whether unique records must be returned in the output or not.
      • Compare the Suspect record to the selected Candidate record only: Specify the maximum number of duplicate records to be searched and returned.
    11. Set the compressOutput flag of the InterMatchDetail instance to true to compress the output of the job.
    12. If the input data does not have match keys, you must specify the match key settings to first run the Match Key Generator job to generate the match keys, before running the Interflow Match job.
      To generate the match keys for the input data, specify the match key settings by creating and configuring an instance of MatchKeySettings to generate a match key before performing the interflow matching. Set this instance using the matchKeySettings field of the InterMatchDetail instance.
      Note: To see how to set match key settings, see the code samples.
  3. To create a MapReduce job, use the previously created instance of AdvanceMatchFactory to invoke its method createJob(). In this, pass the above instance of InterMatchDetail as an argument.
    The createJob() method creates the job and returns a List of instances of ControlledJob.
  4. Run the created job using an instance of JobControl.
  5. To display the reporting counters after successful MapReduce job run, use the previously created instance of AdvanceMatchFactory to invoke its method getCounters(), passing the created job as an argument.