Using a Validate Address Global Spark Job

  1. Create an instance of GlobalAddressingFactory, using its static method getInstance().
  2. Provide the input and output details for the Validate Address Global job by creating an instance of GlobalAddressingDetail specifying the ProcessType. The instance must use the type SparkProcessType. For this, the steps are:
    1. Configure the JVM initialization settings by creating an instance of GlobalAddressingGeneralConfiguration.
    2. Set the details of the Reference Data path by creating an instance of ReferenceDataPath. See Enum ReferenceDataPathLocation.
    3. Configure the necessary database settings by creating an instance of GlobalAddressingEngineConfiguration by passing the above ReferenceDataPath instance as an argument.
      1. Set the preloading type in this instance using the enum Enum PreloadingType.
      2. Set the database type using the Enum DatabaseType.
      3. Set the supported countries using the Enum CountryCodes.
      4. If all countries are supported, set the isAllCountries attribute to true. Else, specify the comma-separated list of Enum CountryCodes values in the supportedCountries String value.
    4. Configure the input settings by creating an instance of GlobalAddressingInputConfiguration.
      To set the values of the various fields of this instance, use the enums Enum CountryCodes, Enum StateProvinceType, Enum CountryType, Enum PreferredScript, Enum PreferredLanguage, Enum Casing, Enum OptimizationLevel, Enum Mode, and Enum MatchingScope as applicable.
    5. Set the unlock key for the data as a String value in a List.
    6. Create an instance of GlobalAddressingDetail, by passing an instance of type Config, the List of unlock code values, the GlobalAddressingEngineConfiguration instance, and the GlobalAddressingInputConfiguration instance created earlier as the arguments to its constructor.

      The Config parameter must be an instance of type SparkJobConfig.

      The value of GROUPBY_REGION in this parameter is set to true by default. The jobs process the addresses of those regions for which you have added the reference data. For example, the input addresses of Germany are processed if reference data of Germany is placed on HDFS.

      1. Set the JVM initialization configurations by setting the generalConfiguration field of the GlobalAddressingDetail instance to the GlobalAddressingGeneralConfiguration instance created above.
      2. Set the details of the input file using the inputPath field of the GlobalAddressingDetail instance.
        Note:
        • For a text input file, create an instance of FilePath with the relevant details of the input file by invoking the appropriate constructor.
        • For an ORC input file, create an instance of OrcFilePath with the path of the ORC input file as the argument.
        • For a parquet input file, create an instance of ParquetFilePath with the path of the parquet input file as the argument.
      3. Set the details of the output file using the outputPath field of the GlobalAddressingDetail instance.
        Note:
        • For a text output file, create an instance of FilePath with the relevant details of the output file by invoking the appropriate constructor.
        • For an ORC output file, create an instance of OrcFilePath with the path of the ORC output file as the argument.
        • For a parquet output file, create an instance of ParquetFilePath with the path of the parquet output file as the argument.

      4. Set the name of the job using the jobName field of the GlobalAddressingDetail instance.
  3. To create and run the Spark job, use the previously created instance of GlobalAddressingFactory to invoke its method runSparkJob(). In this, pass the above instance of GlobalAddressingDetail as an argument.
    The runSparkJob() method runs the job and returns a Map of the reporting counters of the job.
  4. Display the counters to view the reporting statistics for the job.