Using a Table Lookup MapReduce Job

  1. Create an instance of DataNormalizationFactory, using its static method getInstance().
  2. Provide the input and output details for the Table Lookup job by creating an instance of TableLookupDetail specifying the ProcessType. The instance must use the type MRProcessType.
    1. Configure the table lookup rules by creating an instance of TableLookupConfiguration.
      Within this instance, add an instance of type AbstractTableLookupRule. This AbstractTableLookupRule instance must be defined using one of these classes: Standardize, Categorize or Identify, corresponding to the desired table lookup rule category.
    2. Set the details of the Reference Data path and location type by creating an instance of ReferenceDataPath. See Enum ReferenceDataPathLocation.
    3. Create an instance of TableLookupDetail, by passing an instance of type JobConfig, and the TableLookupConfiguration and ReferenceDataPath instances created earlier as the arguments to its constructor.
      The JobConfig parameter must be an instance of type MRJobConfig.
    4. Set the details of the input file using the inputPath field of the TableLookupDetail instance.
      • For a text input file, create an instance of FilePath with the relevant details of the input file by invoking the appropriate constructor.
      • For an ORC input file, create an instance of OrcFilePath with the path of the ORC input file as the argument.
      • For a Parquet input file, create an instance of ParquetFilePath with the path of the Parquet input file as the argument.
    5. Set the details of the output file using the outputPath field of the TableLookupDetail instance.
      • For a text output file, create an instance of FilePath with the relevant details of the output file by invoking the appropriate constructor.
      • For an ORC output file, create an instance of OrcFilePath with the path of the ORC output file as the argument.
      • For a Parquet output file, create an instance of ParquetFilePath with the path of the Parquet output file as the argument.
    6. Set the name of the job using the jobName field of the TableLookupDetail instance.
    7. Set the compressOutput flag of the TableLookupDetail instance to true to compress the output of the job.
  3. To create a MapReduce job, use the previously created instance of DataNormalizationFactory to invoke its method createJob(). In this, pass the above instance of TableLookupDetail as an argument.
    The createJob() method returns a List of instances of ControlledJob.
  4. Run the created job using an instance of JobControl.
  5. To display the reporting counters post a successful MapReduce job run, use the previously created instance of DataNormalizationFactory to invoke its method getCounters(), passing the created job as an argument.