Using a Hive UDF of Universal Addressing Module

To run each Hive UDF job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.

  1. In your Hive client, log in to the required Hive database.
  2. Register the JAR file of Spectrum™ Data & Address Quality for Big Data SDK UAM Module.
    ADD JAR /home/hduser/uam/uam.universaladdress.hive.${project.version}.jar;
  3. Create an alias for the Hive UDF of the Address Quality job you want to run.
    Note: String in quotes represents the class names needed for this job to run.
    For example:
    CREATE TEMPORARY FUNCTION addressvalidation as 'com.pb.bdq.uam.process.hive.addressvalidation.AddressValidationUDF';
  4. Enable or disable the hive fetch task conversion.
    For Example:
    set hive.fetch.task.conversion=none;
  5. Use hivevar:engineconf to set the engine configurations. It includes details, such as database settings, COBOL runtime path, process type, DPV DB path, suiteLinkDBPath, ewsDBPath, rdiDBPath, lacsDBPath and preloading type
    For Example:
    set hivevar:engineconf='{"referenceData":{"dataDir":"/user/hduser/ReferenceData/
                            AddressQuality/UAM/Data.zip","referenceDataPathLocation":"HDFS"},"cobolRuntimePath":"",
                            "modulesDir":"","dpvDbPath":"/user/hduser/ReferenceData/AddressQuality/UAM/Data.zip",
                            "suiteLinkDBPath":"/user/hduser/ReferenceData/AddressQuality/UAM/Data.zip","ewsDBPath":
                            "/user/hduser/ReferenceData/AddressQuality/UAM/Data.zip","rdiDBPath":null,"lacsDBPath":
                            "/user/hduser/ReferenceData/AddressQuality/UAM/Data.zip"}';
  6. Specify the settings for the input data using the hivevar:inputoption parameter.
    For example:
    set hivevar:inputoption='{"casing":"Mixed","matchMode":"Relaxed","defaultCountry":"GBR",
                            "maximumResults":2,"returnInputAddress":false,
                            "returnParsedAddress":false,"returnPrecisionCode":false,"returnMatchScore":true,
                            "mustMatchAddressNumber":false,"mustMatchStreet":false,"mustMatchCity":false,
                            "mustMatchLocality":false,"mustMatchState":false,"mustMatchStateProvince":false,
                            "mustMatchPostCode":false,"keepMultiMatch":true,"preferPostalOverCity":false,
                            "cityFallback":true,"postalFallback":true,
                            "validationLevel":"ADDRESS"}';
  7. Set the general configurations, such as cacheSize, maxAddressObjectCount, and maxMemoryUsageMB, using hivevar:generalconf.
    For example:
    set hivevar:generalconf='{"cacheSize":"LARGE","maxThreadCount":8,"maxAddressObjectCount":8,
    "rangesToExpand":"NONE","flexibleRangeExpansion":"ON","enableTransactionLogging":false,
    "maxMemoryUsageMB":1024,"verbose":false}';
  8. Specify the desired validation level to be used in a particular Hive job. Currently, only address validation is supported. So, set this value to VALIDATE.
    For example;
    set hivevar:processtype='VALIDATE';
  9. Specify the header fields of the input table in comma-separated format, and assign to a variable or configuration property.
    set hivevar:header='inputkeyvalue,AddressLine1,AddressLine2,City,postalcode,
    StateProvince,firmname,Country';
  10. To run the job and display the job output on the console, write the query as indicated in this example:
    Note: This query returns a map of key value pairs containing output fields for each row.
    SELECT tmp2.record["HouseNumber"],tmp2.record["Confidence"],tmp2.record["AddressLine1"],
    tmp2.record["StreetName"],tmp2.record["PostalCode"],tmp2.record["ElementInputStatus"],
    tmp2.record["MailabilityScore"] FROM ( SELECT  globalvalidation(${hivevar:engineconf},
    ${hivevar:generalconf},${hivevar:inputconf},${hivevar:unlockCode},${hivevar:header},recordid,
    addressline1,city,stateprovince,postalcode,country) as mygp from address)
    as addressgroup LATERAL VIEW explode(addressgroup.mygp) tmp2 as record ;
    To run the job and dump the job output in a designated file, write the query as indicated in this example:
    INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/GlobalAddressing/' row format delimited
    FIELDS TERMINATED BY ',' lines terminated by '\n' STORED AS TEXTFILE
    SELECT tmp2.record["HouseNumber"],tmp2.record["Confidence"],tmp2.record["AddressLine1"],
    tmp2.record["StreetName"],tmp2.record["PostalCode"],tmp2.record["ElementInputStatus"],
    tmp2.record["MailabilityScore"] FROM ( SELECT  globalvalidation(${hivevar:engineconf},
    ${hivevar:generalconf},${hivevar:inputconf},${hivevar:unlockCode},${hivevar:header},
    recordid,addressline1,city,stateprovince,postalcode,country) as mygp from address)
    as addressgroup LATERAL VIEW explode(addressgroup.mygp) tmp2 as record ;
    Note: Use the alias defined earlier for the UDF.