Components of a Hive Function

The key components required to run a Spectrum™ Data & Address Quality for Big Data SDK Hive UDF are:
JAR File
The Spectrum™ Data & Address Quality for Big Data SDK Hive JAR file of the module to which the desired Data Quality Hive UDF belongs. This must be registered before using any UDF.
Job UDF / UDAF
Each Data Quality job is provided as either a User Defined Function (UDF) or a User Defined Aggregation Function (UDAF).
Alias
The alias assigned to a Hive UDF. This is optional.
Configurations
The rules specified in JSON format, and other configuration details, based on which the job is to be run.
Reference Data
The reference data can reside on Hadoop Distributed File System (HDFS) or locally on cluster machines.
On HDFS, the reference data can be in any of these two formats:
  • As files
  • As archive
In case of local placement, the reference data must be present on each node of the cluster at the same path.
Header
The header fields of the input table, in comma-separated format.
Input Table
The table which provides the input records respectively for the Hive UDF to be run.
Candidate Table
The table which provides the candidate records for the Hive UDF to be run, in case of the Interflow Match UDAF.
Suspect Table
The table which provides the suspect records for the Hive UDF to be run, in case of Interflow Match UDAF.
hive.fetch.task.conversion
To convert select queries to a single FETCH task, minimizing latency.

Set the value to none or minimal. Default is minimal.

Note: This configuration is required for all UDFs.
hive.map.aggr
To turn the aggregation of data between Mapper and Reducer on or off, set this Hive environment variable to false. By default, it is true and the data is aggregated.

Set this value to false for all Hive jobs in the SDK.

Note: This configuration is required for all UDAFs.
General Configurations
The memory configurations required to run the job.
Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
Input Configurations
The settings for the input data.
Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
Engine Configurations
To set various configurations, such as database settings, COBOL runtime path, preloading type.
Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
LD_LIBRARY_PATH
To set this environment variable to the paths of the various COBOL libraries required while running the Hive jobs.
Note: This configuration is required only for the Validate Address Hive UDF.
Process Type
To specify the desired validation level to be used in a particular Hive job of the SDK. Currently, only address validation is supported.

Set this value to VALIDATE.

Note: This configuration is required only for the Validate Address and Validate Address Loqate Hive UDAFs.
Output
The output of the Hive UDF, which may be displayed on the console or dumped to an output file.
Query
The query to run the required Hive UDF.
For each job, you can achieve any of these using the applicable query syntax:
  • Display the output of the job on the console.
  • Dump the output of the job in a designated output file.