Using a Hive UDF

To run each Hive UDF-based job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.

  1. In your Hive client, log in to the required Hive database.
  2. Register the JAR file of the particular Big Data Quality SDK Module to which the desired Data Quality Hive UDF belongs.
  3. In case of the Validate Address UDAF, to set the path of the COBOL libraries, set the environment variable LD_LIBRARY_PATH as below:
    set mapreduce.admin.user.env = LD_LIBRARY_PATH=/home/hduser/~/runtime/lib:
    /home/hduser/~/runtime/bin:/home/hduser/~/server/modules/universaladdress/lib, ACU_RUNCBL_JNI_ONLOAD_DISABLE=1, G1RTS=/home/hduser/~/ ;
  4. In case of the Validate Address Global UDAF, add the file libAddressDoctor5.so file as well.
  5. In case of the Validate Address Loqate UDAF, add these required files to the distributed cache.
    • loqate-core.car
    • LoqateVerificationLevel.csv
    • Loqate.csv
    • countryTables.csv
    • countryNameTables.csv
  6. Create an alias for the Hive UDF of the Data Quality job you wish to run.
    For example:
    CREATE TEMPORARY FUNCTION matchkeygenerator as 'com.pb.bdq.amm.process.hive.matchkeygenerator.MatchKeyGeneratorUDF';
  7. Specify the configurations like the match rule, sort field, express match column, and other details for the job and assign to respective variable or configuration properties.
    Note: The rule must be in JSON format.

    For example:

    set rule='{"matchKeys":[{"expressMatchKey":false, "matchKeyField":"MatchKey1",
    "rules":[{"algorithm":"Soundex"," field":"businessname", "startPosition":1, "length":0,"active":true, "sortInput":null, "removeNoiseCharacters":false}]}, 
    {"expressMatchKey":false, "matchKeyField":"MatchKey2", "rules":[{"algorithm":"Koeln", "field":"businessname", "startPosition":1, "length":0, "active":true, "sortInput":null, "removeNoiseCharacters":false}]}]}';
    Note: Ensure to use the configuration properties in the respective job configurations. For example, pb.bdq.match.rule, pb.bdq.match.express.column, pb.bdq.consolidation.sort.field, and so on where indicated in the respective sample HQL files.
  8. Specify the header fields of the input table, in comma-separated format, and assign to a variable or configuration property.
    set pb.bdq.match.header='businessname,recordid';
    Note: Ensure to use the configuration property, where indicated. For example, pb.bdq.match.header, pb.bdq.consolidation.header, and so on where indicated in the respective sample HQL files.
  9. Switch off the aggregatiion of data between Reducer and Mapper, by seting the Hive.Map.Aggr environment variable configuration to false, as indicated in the below example:
    set hive.map.aggr = false;
    Note: This configuration is required for all UDAFs.
  10. Set the general configurations for running the job as indicated in the below example:
    set pb.bdq.uam.universaladdress.general.configuration = {"dFileType":"SPLIT", "dMemoryModel":"MEDIUM", "lacsLinkMemoryModel":"MEDIUM", "suiteLinkMemoryModel":"MEDIUM"};
    Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
  11. Set the input configurations for running the job as indicated in the below example:
    set pb.bdq.uam.universaladdress.input.configuration = {"outputStandardAddress":true, "outputPostalData":false, "outputParsedInput":false, "outputAddressBlocks":true, "performUSProcessing":true, "performCanadianProcessing":false, "performInternationalProcessing":false, "outputFormattedOnFail":false, "outputCasing":"MIXED", "outputPostalCodeSeparator":true, "outputMultinationalCharacters":false, "performDPV":false, "performRDI":false, "performESM":false, "performASM":false, "performEWS":false, "performLACSLink":false, "performLOT":false, "failOnCMRAMatch":false, "extractFirm":false, "extractUrb":false, "outputReport3553":false, "outputReportSERP":false, "outputReportSummary":true, "outputCASSDetail":false, "outputFieldLevelReturnCodes":false, "keepMultimatch":false, "maximumResults":10, "standardAddressFormat":"STANDARD_ADDRESS_FORMAT_COMBINED_UNIT", "standardAddressPMBLine":"STANDARD_ADDRESS_PMB_LINE_NONE", "cityNameFormat":"CITY_FORMAT_STANDARD", "vanityCityFormatLong":true, "outputCountryFormat":"ENGLISH", "homeCountry":"United States", "streetMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "firmMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "directionalMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "dualAddressLogic":"DUAL_NORMAL", "dpvSuccessfulStatusCondition":"A", "reportListFileName":"", "reportlistProcessorName":"", "reportlistNumber":1, "reportMailerAddress":"", "reportMailerName":"", "reportMailerCityLine":"", "canReportMailerCPCNumber":"", "canReportMailerAddress":"", "canReportMailerName":"", "canReportMailerCityLine":"", "internationalCityStreetSearching":100, "addressLineSearchOnFail":true, "outputStreetAlias":true, "outputVeriMoveBlock":false, "dpvDetermineNoStat":false, "dpvDetermineVacancy":false, "outputAbbreviatedAlias":false, "outputPreferredAlias":false, "outputPreferredCity":"CITY_OVERRIDE_NAME_ZIP4", "performSuiteLink":false, "suppressZplusPhantomCarrierR777":false, "canStandardAddressFormat":"D", "canEnglishApartmentLabel":"APT", "canFrenchApartmentLabel":"APP", "canFrenchFormat":"C", "canOutputCityFormat":"D", "canOutputCityAlias":true, "canDualAddressLogic":"D", "canPreferHouseNum":false, "canSSLVRFLG":false, "canRuralRouteFormat":"A", "canNonCivicFormat":"A", "canDeliveryOfficeFormat":"I", "canEnableSERP":false, "canSwitchManagedPostalCodeConfidence":false, "stats":null, "counts":null, "z3seg":null, "serpStats":null, "dpvSeedList":null, "lacsSeedList":null, "zipInputSet":null, "reportName":null, "currentUser":null, "jobName":null, "jobId":null, "jobRequest":false, "properties":{"DPVDetermineVacancy":"N", "DualAddressLogic":"N", "ExtractUrb":"N", "CanFrenchFormat":"C", "AddressLineSearchOnFail":"Y", "OutputFieldLevelReturnCodes":"N", "OutputFormattedOnFail":"N", "OutputStreetNameAlias":"Y", "OutputReportSERP":"N", "OutputAddressBlocks":"Y", "ExtractFirm":"N", "CanEnglishApartmentLabel":"APT", "OutputPreferredCity":"Z", "FirmMatchingStrictness":"M", "CanFrenchApartmentLabel":"APP", "KeepMultimatch":"N", "StandardAddressPMBLine":"N", "PerformSuiteLink":"N", "CanStandardAddressFormat":"D", "DPVSuccessfulStatusCondition":"A", "PerformLACSLink":"N", "PerformUSProcessing":"Y", "PerformEWS":"N", "StandardAddressFormat":"C", "SuppressZplusPhantomCarrierR777":"N", "HomeCountry":"United States", "ReportMailerAddress":"", "OutputReport3553":"N", "OutputVeriMoveDataBlock":"N", "CanDeliveryOfficeFormat":"I", "OutputAbbreviatedAlias":"N", "PerformCanadianProcessing":"N", "PerformDPV":"N", "PerformInternationalProcessing":"N", "CanSSLVRFlg":"N", "StreetMatchingStrictness":"M", "InternationalCityStreetSearching":"100", "canSwitchManagedPostalCodeConfidence":"N", "CanDualAddressLogic":"D", "PerformASM":"N", "OutputCasing":"M", "ReportListFileName":"", "CanReportMailerAddress":"", "ReportMailerCityLine":"", "CanReportMailerCPCNumber":"", "ReportListProcessorName":"", "CanOutputCityAlias":"Y", "DirectionalMatchingStrictness":"M", "CanRuralRouteFormat":"A", "CanOutputCityFormat":"D", "ReportListNumber":"1", "CanReportMailerCityLine":"", "OutputMultinationalCharacters":"N", "EnableSERP":"N", "CanNonCivicFormat":"A", "OutputShortCityName":"S", "OutputPostalCodeSeparator":"Y", "FailOnCMRAMatch":"N", "PerformLOT":"N", "OutputCountryFormat":"E", "CanPreferHouseNum":"N", "CanReportMailerName":"", "PerformRDI":"N", "ReportMailerName":"", "PerformESM":"N", "OutputReportSummary":"Y", "OutputVanityCityFormatLong":"Y", "OutputPreferredAlias":"N", "DPVDetermineNoStat":"N", "MaximumResults":"10"}}};
    Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
  12. Set the engine configurations for running the job as indicated in the below example:
    set pb.bdq.uam.universaladdress.engine.configurations = { "referenceData":{ "dataDir":"/home/hduser/resources/uam/universaladdress/UAM_universaladdress4.0_Feb15/", "referenceDataPathLocation":"LocaltoDataNodes"}, "cobolRuntimePath":"/home/hduser/tapan/addressquality/", "modulesDir":"/home/hduser/tapan/addressquality/modules", "dpvDbPath":null, "suiteLinkDBPath":null, "ewsDBPath":null, "rdiDBPath":null, "lacsDBPath":null};
    Note: This configuration is required only for Universal Addressing Module Hive UDAFs.
  13. Set the process type to indicate the desired validation level. We currently support address validation only.
    For example, in the Validate Address job, set the process type as below:
    set pb.bdq.uam.universaladdress.process.type=VALIDATE; 
    Note: This configuration is required only for the Validate Address and Validate Address Loqate Hive UDAFs.
  14. To run the job and display the job output on the console, write the query as indicated in the below example:
    SELECT businessname, recordid, bar.ret["MatchKey1"] AS MatchKey1, bar.ret["MatchKey2"] AS MatchKey2 FROM (
    SELECT *, matchkeygenerator (${hiveconf:rule}, ${hiveconf:header}, businessname, recordid) AS ret FROM cust ) bar;
    To run the job and dump the job output in a designated file, write the query as indicated in the below example:
    INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/MatchKey/' row format delimited FIELDS TERMINATED BY ',' MAP FIELDS TERMINATED BY ':' COLLECTION ITEMS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE 
    SELECT businessname, recordid, bar.ret["MatchKey1"] AS MatchKey1, bar.ret["MatchKey2"] AS MatchKey2 FROM (
    SELECT *, matchkeygenerator (${hiveconf:rule}, ${hiveconf:header}, businessname, recordid) AS ret FROM cust ) bar;
    Note: Ensure to use the alias defined earlier for the UDF.
Important: For all UDAF jobs, use the respective configuration properties as variables while defining the input parameters, where indicated in the respective sample HQL files.

For example, pb.bdq.match.rule, pb.bdq.match.express.column, pb.bdq.consolidation.sort.field, and so on.