Using a Hive UDF

To run each Hive UDF job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.

  1. In your Hive client, log in to the required Hive database.
  2. Register the JAR file of the particular Spectrum™ Data & Address Quality for Big Data SDK Module to which the desired Data Quality Hive UDF belongs.
  3. In case of the Validate Address UDF, to set the path of the COBOL libraries, set the environment variable LD_LIBRARY_PATH as:
    set mapreduce.admin.user.env = LD_LIBRARY_PATH=/home/hduser/~/runtime/lib:
    /home/hduser/~/runtime/bin:/home/hduser/~/server/modules/universaladdress/lib, ACU_RUNCBL_JNI_ONLOAD_DISABLE=1, G1RTS=/home/hduser/~/ ;
  4. In case of the Validate Address Global UDF, add the file libAddressDoctor5.so file as well.
  5. In case of the Validate Address Loqate UDF, add these required files to the distributed cache.
    • loqate-core.car
    • LoqateVerificationLevel.csv
    • Loqate.csv
    • countryTables.csv
    • countryNameTables.csv
  6. Create an alias for the Hive UDF of the Data Quality job you wish to run.
    For example:
    CREATE TEMPORARY FUNCTION matchkeygenerator as 'com.pb.bdq.amm.process.hive.matchkeygenerator.MatchKeyGeneratorUDF';
  7. Specify the reference data path.
    • If the reference data is on HDFS, add the reference data and set reference directory as shown in this example.

      If the reference data is in file format:

      hdfs://<HOST>:<PORT>/home/hduser/Refdata/;
      set hivevar:refdir='./Refdata';

      If the reference data is in archive format:

      hdfs://<HOST>:<PORT>/home/hduser/ref.zip;
      set hivevar:refdir='./ref.zip';
    • If the reference data is on local path, ensure data is present on each node of the cluster on the same path.

      Set the reference directory as shown:

      set hivevar:refdir='/home/hadoop/reference/';
  8. Specify the configurations, such as the match rule, sort field, express match column, and other details for the job and assign to respective variable or configuration properties.
    Note: The rule must be in JSON format.

    For example

    set rule='{"matchKeys":[{"expressMatchKey":false, "matchKeyField":"MatchKey1",
    "rules":[{"algorithm":"Soundex"," field":"businessname", "startPosition":1, "length":0,"active":true, "sortInput":null, "removeNoiseCharacters":false}]}, 
    {"expressMatchKey":false, "matchKeyField":"MatchKey2", "rules":[{"algorithm":"Koeln", "field":"businessname", "startPosition":1, "length":0, "active":true, "sortInput":null, "removeNoiseCharacters":false}]}]}';
    Note: Use the configuration properties in the respective job configurations. For example, pb.bdq.match.rule, pb.bdq.match.express.column, and pb.bdq.consolidation.sort.field, wherever indicated in the respective sample HQL files.
  9. Specify the header fields of the input table, in comma-separated format, and assign to a variable or configuration property.
    set pb.bdq.match.header='businessname,recordid';
    Note: Use the configuration property where indicated in the HQL file. For example, pb.bdq.match.header and pb.bdq.consolidation.header.
  10. Switch off the aggregatiion of data between Reducer and Mapper, by seting the Hive.Map.Aggr environment variable configuration to false, as indicated in this example:
    set hive.map.aggr = false;
    Note: This configuration is required for all UDFs.
  11. Set the general configurations for running the job as indicated in this example:
    set pb.bdq.uam.universaladdress.general.configuration = {"dFileType":"SPLIT", "dMemoryModel":"MEDIUM", "lacsLinkMemoryModel":"MEDIUM", "suiteLinkMemoryModel":"MEDIUM"};
    Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
  12. Set the input configurations for running the job as indicated in this example:
    set pb.bdq.uam.universaladdress.input.configuration = 
    {"outputStandardAddress":true, "outputPostalData":false, "outputParsedInput":false,
     "outputAddressBlocks":true, "performUSProcessing":true, "performCanadianProcessing":
    false, "performInternationalProcessing":false, "outputFormattedOnFail":false, 
    "outputCasing":"MIXED", "outputPostalCodeSeparator":true, "outputMultinationalCharacters":
    false, "performDPV":false, "performRDI":false, "performESM":false, "performASM":false, 
    "performEWS":false, "performLACSLink":false, "performLOT":false, "failOnCMRAMatch":false,
     "extractFirm":false, "extractUrb":false, "outputReport3553":false, "outputReportSERP":false,
     "outputReportSummary":true, "outputCASSDetail":false, "outputFieldLevelReturnCodes":false,
     "keepMultimatch":false, "maximumResults":10, "standardAddressFormat":
    "STANDARD_ADDRESS_FORMAT_COMBINED_UNIT", "standardAddressPMBLine":
    "STANDARD_ADDRESS_PMB_LINE_NONE", "cityNameFormat":"CITY_FORMAT_STANDARD",
     "vanityCityFormatLong":true, "outputCountryFormat":"ENGLISH", "homeCountry":
    "United States", "streetMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", 
    "firmMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "directionalMatchingStrictness":
    "MATCHING_STRICTNESS_MEDIUM", "dualAddressLogic":"DUAL_NORMAL", 
    "dpvSuccessfulStatusCondition":"A", "reportListFileName":"", "reportlistProcessorName":"",
     "reportlistNumber":1, "reportMailerAddress":"", "reportMailerName":"", 
    "reportMailerCityLine":"", "canReportMailerCPCNumber":"", "canReportMailerAddress":"",
     "canReportMailerName":"", "canReportMailerCityLine":"", "internationalCityStreetSearching"
    :100, "addressLineSearchOnFail":true, "outputStreetAlias":true, "outputVeriMoveBlock":false,
     "dpvDetermineNoStat":false, "dpvDetermineVacancy":false, "outputAbbreviatedAlias":false,
     "outputPreferredAlias":false, "outputPreferredCity":"CITY_OVERRIDE_NAME_ZIP4", 
    "performSuiteLink":false, "suppressZplusPhantomCarrierR777":false, "canStandardAddressFormat"
    :"D", "canEnglishApartmentLabel":"APT", "canFrenchApartmentLabel":"APP", "canFrenchFormat":
    "C", "canOutputCityFormat":"D", "canOutputCityAlias":true, "canDualAddressLogic":"D", 
    "canPreferHouseNum":false, "canSSLVRFLG":false, "canRuralRouteFormat":"A", 
    "canNonCivicFormat":"A", "canDeliveryOfficeFormat":"I", "canEnableSERP":false, 
    "canSwitchManagedPostalCodeConfidence":false, "stats":null, "counts":null, "z3seg":null,
     "serpStats":null, "dpvSeedList":null, "lacsSeedList":null, "zipInputSet":null, "reportName"
    :null, "currentUser":null, "jobName":null, "jobId":null, "jobRequest":false, "properties":
    {"DPVDetermineVacancy":"N", "DualAddressLogic":"N", "ExtractUrb":"N", "CanFrenchFormat"
    :"C", "AddressLineSearchOnFail":"Y", "OutputFieldLevelReturnCodes":"N", 
    "OutputFormattedOnFail":"N", "OutputStreetNameAlias":"Y", "OutputReportSERP":"N",
     "OutputAddressBlocks":"Y", "ExtractFirm":"N", "CanEnglishApartmentLabel":"APT", 
    "OutputPreferredCity":"Z", "FirmMatchingStrictness":"M", "CanFrenchApartmentLabel":"APP",
     "KeepMultimatch":"N", "StandardAddressPMBLine":"N", "PerformSuiteLink":"N", 
    "CanStandardAddressFormat":"D", "DPVSuccessfulStatusCondition":"A", "PerformLACSLink":"N",
     "PerformUSProcessing":"Y", "PerformEWS":"N", "StandardAddressFormat":"C", 
    "SuppressZplusPhantomCarrierR777":"N", "HomeCountry":"United States", 
    "ReportMailerAddress":"", "OutputReport3553":"N", "OutputVeriMoveDataBlock":"N",
     "CanDeliveryOfficeFormat":"I", "OutputAbbreviatedAlias":"N", "PerformCanadianProcessing":
    "N", "PerformDPV":"N", "PerformInternationalProcessing":"N", "CanSSLVRFlg":"N", 
    "StreetMatchingStrictness":"M", "InternationalCityStreetSearching":"100", 
    "canSwitchManagedPostalCodeConfidence":"N", "CanDualAddressLogic":"D", "PerformASM":
    "N", "OutputCasing":"M", "ReportListFileName":"", "CanReportMailerAddress":"", 
    "ReportMailerCityLine":"", "CanReportMailerCPCNumber":"", "ReportListProcessorName":"",
     "CanOutputCityAlias":"Y", "DirectionalMatchingStrictness":"M", "CanRuralRouteFormat":
    "A", "CanOutputCityFormat":"D", "ReportListNumber":"1", "CanReportMailerCityLine":"",
     "OutputMultinationalCharacters":"N", "EnableSERP":"N", "CanNonCivicFormat":"A", 
    "OutputShortCityName":"S", "OutputPostalCodeSeparator":"Y", "FailOnCMRAMatch":"N", 
    "PerformLOT":"N", "OutputCountryFormat":"E", "CanPreferHouseNum":"N", 
    "CanReportMailerName":"", "PerformRDI":"N", "ReportMailerName":"", "PerformESM":"N",
     "OutputReportSummary":"Y", "OutputVanityCityFormatLong":"Y", "OutputPreferredAlias":"N",
     "DPVDetermineNoStat":"N", "MaximumResults":"10"}}};
    Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
  13. Set the engine configurations for running the job as indicated in this example:
    set pb.bdq.uam.universaladdress.engine.configurations = { "referenceData":{ "dataDir":"/home/hduser/resources/uam/universaladdress/UAM_universaladdress4.0_Feb15/", "referenceDataPathLocation":"LocaltoDataNodes"}, "cobolRuntimePath":"/home/hduser/addressquality/", "modulesDir":"/home/hduser/tapan/addressquality/modules", "dpvDbPath":null, "suiteLinkDBPath":null, "ewsDBPath":null, "rdiDBPath":null, "lacsDBPath":null};
    Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
  14. Set the process type to indicate the desired validation level. We currently support address validation only.
    For example, in the Validate Address job, set the process type as below:
    set pb.bdq.uam.universaladdress.process.type=VALIDATE; 
    Note: This configuration is required only for the Validate Address and Validate Address Loqate Hive UDAFs.
  15. To run the job and display the job output on the console, write the query as indicated in this example:
    SELECT businessname, recordid, bar.ret["MatchKey1"] AS MatchKey1, bar.ret["MatchKey2"] AS MatchKey2 FROM (
    SELECT *, matchkeygenerator (${hiveconf:rule}, ${hiveconf:header}, businessname, recordid) AS ret FROM cust ) bar;
    To run the job and dump the job output in a designated file, write the query as indicated in the below example:
    INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/MatchKey/' row format delimited FIELDS TERMINATED BY ',' MAP FIELDS TERMINATED BY ':' COLLECTION ITEMS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE 
    SELECT businessname, recordid, bar.ret["MatchKey1"] AS MatchKey1, bar.ret["MatchKey2"] AS MatchKey2 FROM (
    SELECT *, matchkeygenerator (${hiveconf:rule}, ${hiveconf:header}, businessname, recordid) AS ret FROM cust ) bar;
    Note: Ensure to use the alias defined earlier for the UDF.
Important: For all UDAF jobs, use the respective configuration properties as variables while defining the input parameters, where indicated in the respective sample HQL files.

For example, pb.bdq.match.rule, pb.bdq.match.express.column and pb.bdq.consolidation.sort.field.