Components of a Hive Function
The key components required to run a Spectrum™ Data & Address Quality for Big Data SDK Hive UDF are:
- JAR File
- The Spectrum™ Data & Address Quality for Big Data SDK Hive JAR file of the module to which the desired Data Quality Hive UDF belongs. This must be registered before using any UDF.
- Job UDF / UDAF
- Each Data Quality job is provided as either a User Defined Function (UDF) or a User Defined Aggregation Function (UDAF).
- Alias
- The alias assigned to a Hive UDF. This is optional.
- Configurations
- The rules specified in JSON format, and other configuration details, based on which the job is to be run.
- Reference Data
- The reference data can reside on Hadoop Distributed File System (HDFS) or locally on
cluster machines.On HDFS, the reference data can be in any of these two formats:
- As files
- As archive
- Header
- The header fields of the input table, in comma-separated format.
- Input Table
- The table which provides the input records respectively for the Hive UDF to be run.
- Candidate Table
- The table which provides the candidate records for the Hive UDF to be run, in case of the Interflow Match UDAF.
- Suspect Table
- The table which provides the suspect records for the Hive UDF to be run, in case of Interflow Match UDAF.
- hive.fetch.task.conversion
- To convert select queries to a single FETCH task, minimizing latency.
Set the value to none or minimal. Default is minimal.
Note: This configuration is required for all UDFs. - hive.map.aggr
- To turn the aggregation of data between Mapper and Reducer on or off, set this Hive
environment variable to
false
. By default, it istrue
and the data is aggregated.Set this value to false for all Hive jobs in the SDK.
Note: This configuration is required for all UDAFs. - General Configurations
- The memory configurations required to run the job.Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
- Input Configurations
- The settings for the input data.Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
- Engine Configurations
- To set various configurations, such as database settings, COBOL runtime
path, preloading type.Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
- LD_LIBRARY_PATH
- To set this environment variable to the paths of the various COBOL libraries required
while running the Hive jobs.Note: This configuration is required only for the Validate Address Hive UDF.
- Process Type
- To specify the desired validation level to be used in a particular Hive job of the
SDK. Currently, only address validation is supported.
Set this value to VALIDATE.
Note: This configuration is required only for the Validate Address and Validate Address Loqate Hive UDAFs. - Output
- The output of the Hive UDF, which may be displayed on the console or dumped to an output file.
- Query
- The query to run the required Hive UDF. For each job, you can achieve any of these using the applicable query syntax:
- Display the output of the job on the console.
- Dump the output of the job in a designated output file.