Using a Hive UDF of Universal Name Module

To run each Hive UDF job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.

  1. In your Hive client, log in to the required Hive database.
  2. Register the JAR file of Spectrum™ Data & Address Quality for Big Data SDK UNM Module.
    ADD JAR <Directory path>/unm.hive.${project.version}.jar;
  3. Create an alias for the Hive UDF of the Data Quality job you wish to run.
    For example:
    CREATE TEMPORARY FUNCTION opennameparser as 'com.pb.bdq.unm.process.hive.
    opennameparser.OpenNameParserUDF';
  4. Specify the reference data path.
    • Reference data is on HDFS
      • Reference data is to be downloaded to a working directory for jobs
        • If the reference data is in unarchived file format, set the reference directory as:
          set hivevar:refereceDataDetails='{"referenceDataPathLocation":"HDFS",
          "dataDir":"./referenceData","dataDownloader":{"dataDownloader":"DC"}}';
        • If the reference data is in archived format, set the reference directory as:
          set hivevar:refereceDataDetails='{"referenceDataPathLocation":"HDFS",
          "dataDir":"./referenceData.zip","dataDownloader":
          {"dataDownloader":"DC"}}';
      • Reference data is to be downloaded on local nodes for jobs. In this case, set the reference data directory as:
        set hivevar:refereceDataDetails='{"referenceDataPathLocation":"HDFS",
        "dataDir":"/home/data/dm/referenceData","dataDownloader":{"dataDownloader":
        "HDFS","localFSRepository":"/local/download"}}';
    • Reference data is on local path: Ensure that data is present on each node of the cluster on the same path.

      Set the reference directory as:

      set hivevar:refereceDataDetails='{"referenceDataPathLocation":"LocaltoDataNodes",
      "dataDir":"/home/data/referenceData"}';
  5. Specify the configurations and other details for the job, and assign these to respective variables or configuration properties.
    Note: The rule must be in JSON format.

    For example,

    set hivevar:rule='{"name":"name", "culture":"", "splitConjoinedNames":false, "shortcutThreshold":0, "parseNaturalOrderPersonalNames":false, "naturalOrderPersonalNamesPriority":1,
    "parseReverseOrderPersonalNames":false, "reverseOrderPersonalNamesPriority":2, "parseConjoinedNames":false, "naturalOrderConjoinedPersonalNamesPriority":3, "reverseOrderConjoinedPersonalNamesPriority":4, "parseBusinessNames":false, "businessNamesPriority":5}';
    Note: Use the configuration properties in the respective job configurations. For example, pb.bdq.match.rule, pb.bdq.match.express.column, and pb.bdq.consolidation.sort.field where indicated in the respective sample HQL files.
  6. Specify the header fields of the input table in comma-separated format, and assign to a variable or configuration property.
    set hivevar:header='inputrecordid,Name,nametype';
  7. To run the job and display the job output on the console, write the query as indicated in this example:
    select adTable.adid["Name"], adTable.adid["NameScore"], adTable.adid["CultureCode"] from (select opennameparser(${hivevar:rule}, ${hivevar:refdir}, ${hivevar:header},
    inputrecordid, name, nametype) as tmp1 from nameparser) as tmp LATERAL VIEW explode(tmp1) adTable AS adid;
    To run the job and dump the job output in a designated file, write the query as indicated in the below example:
    INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/opennameparser/' row format delimited FIELDS TERMINATED BY ',' lines terminated by '\n' STORED AS TEXTFILE 
    select adTable.adid["Name"], adTable.adid["NameScore"], adTable.adid["CultureCode"] from (select opennameparser(${hivevar:rule}, ${hivevar:refdir},
    ${hivevar:header}, inputrecordid, name, nametype) as tmp1 from nameparser) as tmp LATERAL VIEW explode(tmp1) adTable AS adid;
    Note: Use the alias defined earlier for the UDF.