Using an Intraflow Match Spark Job

  1. Create an instance of AdvanceMatchFactory, using its static method getInstance().
  2. Provide the input and output details for the Intraflow Match job by creating an instance of IntraMatchDetail specifying the ProcessType. The instance must use the type SparkProcessType.
    1. Specify the column using which the records are to be grouped by creating an instance of GroupbyOption.
      Use an instance of GroupbySparkOption to specify the group-by column.
    2. Generate the matching rules for the job by creating an instance of MatchRule.
    3. Create an instance of IntraMatchDetail, by passing an instance of type JobConfig, the GroupbyOption instance created, and the MatchRule instance created above as the arguments to its constructor.
      The JobConfig parameter must be an instance of type SparkJobConfig.
    4. Set the details of the input file using the inputPath field of the IntraMatchDetail instance.
      • For a text input file, create an instance of FilePath with the relevant details of the input file by invoking the appropriate constructor.
      • For an ORC input file, create an instance of OrcFilePath with the path of the ORC input file as the argument.
      • For a Parquet input file, create an instance of ParquetFilePath with the path of the Parquet input file as the argument.
    5. Set the details of the output file using the outputPath field of the IntraMatchDetail instance.
      • For a text output file, create an instance of FilePath with the relevant details of the output file by invoking the appropriate constructor.
      • For an ORC output file, create an instance of OrcFilePath with the path of the ORC output file as the argument.
      • For a Parquet output file, create an instance of ParquetFilePath with the path of the Parquet output file as the argument.
    6. Set the name of the job using the jobName field of the IntraMatchDetail instance.
    7. Set the Express Match Column using the expressMatchColumn field of the IntraMatchDetail instance, if required.
    8. Set the flag collectionNumberZerotoUniqueRecords of the IntraMatchDetail instance to true to allocate the collection number 0 (zero) to a unique record. The default is true.
      If you do not wish to allocate the collection number zero to unique records, set this flag to false.
    9. Set the compressOutput flag of the IntraMatchDetail instance to true to compress the output of the job.
    10. If the input data does not have match keys, you must specify the match key settings to first run the Match Key Generator job to generate the match keys, before running the Intraflow Match job.
      To generate the match keys for the input data, specify the match key settings by creating and configuring an instance of MatchKeySettings to generate a match key before performing the intraflow matching. Set this instance using the matchKeySettings field of the IntraMatchDetail instance.
      Note: To see how to set match key settings, see the code samples.
  3. To create and run the Spark job, use the previously created instance of AdvanceMatchFactory to invoke its method runSparkJob(). In this, pass the above instance of IntraMatchDetail as an argument.
    The runSparkJob() method runs the job and returns a Map of the reporting counters of the job.
  4. Display the counters to view the reporting statistics for the job.