Components of a Hive Function

The key components required to run a Spectrum™ Data & Address Quality for Big Data SDK Hive UDF are:

JAR File

The Spectrum™ Data & Address Quality for Big Data SDK Hive JAR file of the module to which the desired Data Quality Hive UDF belongs. This must be registered before using any UDF.

Job UDF / UDAF

Each Data Quality job is provided as either a User Defined Function (UDF) or a User Defined Aggregation Function (UDAF).

Alias

The alias assigned to a Hive UDF. This is optional.

Configurations

The rules specified in JSON format, and other configuration details, based on which the job is to be run.

Reference Data

The reference data can reside on Hadoop Distributed File System (HDFS) or locally on cluster machines.

On HDFS, the reference data can be in any of these two formats:

As files
As archive

In case of local placement, the reference data must be present on each node of the cluster at the same path.

Header

The header fields of the input table, in comma-separated format.

Input Table

The table which provides the input records respectively for the Hive UDF to be run.

Candidate Table

The table which provides the candidate records for the Hive UDF to be run, in case of the Interflow Match UDAF.

Suspect Table

The table which provides the suspect records for the Hive UDF to be run, in case of Interflow Match UDAF.

hive.fetch.task.conversion

To convert select queries to a single FETCH task, minimizing latency.

Set the value to none or minimal. Default is minimal.

Note: This configuration is required for all UDFs.

hive.map.aggr

To turn the aggregation of data between Mapper and Reducer on or off, set this Hive environment variable to false. By default, it is true and the data is aggregated.

Set this value to false for all Hive jobs in the SDK.

Note: This configuration is required for all UDAFs.

General Configurations

The memory configurations required to run the job.

Note: This configuration is required only for Universal Addressing Module Hive UDAFs.

Input Configurations

The settings for the input data.

Note: This configuration is required only for Universal Addressing Module Hive UDAFs.

Engine Configurations

To set various configurations, such as database settings, COBOL runtime path, preloading type.

Note: This configuration is required only for Universal Addressing Module Hive UDAFs.

LD_LIBRARY_PATH

To set this environment variable to the paths of the various COBOL libraries required while running the Hive jobs.

Note: This configuration is required only for the Validate Address Hive UDF.

Process Type

To specify the desired validation level to be used in a particular Hive job of the SDK. Currently, only address validation is supported.

Set this value to VALIDATE.

Note: This configuration is required only for the Validate Address and Validate Address Loqate Hive UDAFs.

Output

The output of the Hive UDF, which may be displayed on the console or dumped to an output file.

Query

The query to run the required Hive UDF.

For each job, you can achieve any of these using the applicable query syntax:

Display the output of the job on the console.
Dump the output of the job in a designated output file.