Using a Joiner Spark Job

  1. Create an instance of DataIntegrationFactory by using its static method getInstance().
  2. Provide the input and output details for the job in the JoinDetail instance, specifying the ProcessType as SparkProcessType. Use these steps to create and configure the JoinDetail instance.
    1. Create an instance of JoinDetail by specifying the ProcessType as SparkProcessType and using the default configurations.
    2. Create separate instances of FilePath and for each of those, configure these input file details: RecordSeparator (use Enum RecordSeparator), fieldSeperator, textQualifier, and fileHeader (specify if the first row is to be skipped).
      Note:
      • For a text input file, create an instance of FilePath with the relevant details of the input file by invoking the appropriate constructor.
      • For an ORC input file, create an instance of OrcFilePath with the path of the ORC input file as the argument.
      • For a parquet input file, create an instance of ParquetFilePath with the path of the parquet input file as the argument.
    3. In the JoinDetail instance created in the above step, configure these details:
      • InputPaths: Pass the FilePath instances created and configured above
      • LeftInput: Specify the left input for the join operation
      • JobName: Name of the job
      • JoinType: Use Enum JoinDetail.JoinType to define the join type
      • JoinColumns: Specify the input columns to be joined. These should be comma separated values.
      • OutputPath: Use the setOutputPath method to set the output path of the job, specifying if the file is to be overwritten, and header is to be created.
  3. To create a Spark job, use the previously created instance of DataIntegrationFactory to invoke its method runSparkJob(). In this, pass the JoinDetail instance as an argument.
    The runSparkJob() method creates the job and returns a map of instances of ControlledJob.