Using a Hive UDF of Global Addressing Module
To run each Hive UDF job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.
- In your Hive client, log in to the required Hive database.
-
Register the JAR file of Spectrum™ Data & Address Quality for Big Data SDK GAM Module.
ADD JAR /home/hduser/gam/ gam.globaladdressvalidation.hive.${project.version}.jar
-
Create an alias for the Hive UDF of the Address Quality job you want to
run.
Note: String in quotes represents the class names needed for this job to run.For example:
CREATE TEMPORARY FUNCTION addressvalidation as 'com.pb.bdq.gam.process.hive.addressvalidation.AddressValidationUDF';
-
Enable or disable the hive fetch task conversion.
For Example:
set hive.fetch.task.conversion=none;
-
Use
hivevar:engineconf
to set the engine configurations, such as, Database Path, Country Code, and the Process Type.For Example:set hivevar:engineconf='{"productDatabaseInfoList": [{"dbPath":"/home/hduser/ReferenceData/AddressQuality/GAM/GGB062017", "countryCode":"GBR","processType":"VALIDATE"}]}';
-
Specify the settings for the input data using the
hivevar:inputoption
parameter.For example:set hivevar:inputoption='{"casing":"Mixed","matchMode":"Relaxed", "defaultCountry":"GBR","maximumResults":2,"returnInputAddress":false, "returnParsedAddress":false,"returnPrecisionCode":false,"mustMatchAddressNumber":false, "mustMatchStreet":false,"mustMatchCity":false,"mustMatchLocality":false, "mustMatchState":false,"mustMatchStateProvince":false,"mustMatchPostCode":false, "keepMultiMatch":true,"preferPostalOverCity":false,"cityFallback":true, "postalFallback":true,"validationLevel":"ADDRESS"}';
-
Specify the header fields of the input table in comma-separated
format, and assign to a variable or configuration property.
set hivevar:header='inputkeyvalue,AddressLine1,AddressLine2,City,postalcode, StateProvince,firmname,Country';
-
To run the job and display the job output on the
console, write the query as indicated in this example:
Note: This query returns a map of key value pairs containing output fields for each row.
To run the job and dump the job output in a designated file, write the query as indicated in this example:SELECT tmp2.record["HouseNumber"],tmp2.record["Confidence"],tmp2.record["AddressLine1"], tmp2.record["StreetName"],tmp2.record["PostalCode"] FROM (SELECT addressvalidation(${hivevar:engineconf},${hivevar:inputoption}, ${hivevar:header},inputkeyvalue,addressline1,AddressLine2,city,postalcode,stateprovince,firmname, country)as mygp from address_validation) as addressgroup LATERAL VIEW explode(addressgroup.mygp)tmp2 as record;
INSERT OVERWRITE LOCAL DIRECTORY '/home/hduser/GlobalAddressValidation/'row format delimited FIELDS TERMINATED BY ',' lines terminated by '\n' STORED AS TEXTFILE SELECT * FROM (SELECT addressvalidation(${hivevar:engineconf}, ${hivevar:inputoption},${hivevar:header},inputkeyvalue,addressline1,AddressLine2,city,postalcode, stateprovince,firmname,country)as mygp from address_validation) as addressgroup LATERAL VIEW explode(addressgroup.mygp) tmp2 as record;
Note: Use the alias defined earlier for the UDF.