Using a Hive UDF of Advance Matching Module

To run each Hive UDF job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.

  1. In your Hive client, log in to the required Hive database.
  2. Register the JAR file of Spectrum™ Data & Address Quality for Big Data SDK AMM Module.
    ADD JAR <Directory path>/amm.hive.${project.version}.jar;
  3. Create an alias for the Hive UDF of the Data Quality job you wish to run.
    Note: String in quotes represents the class names needed for this job to run.
    For example:
    CREATE TEMPORARY FUNCTION bestofbreed as 'com.pb.bdq.amm.process.hive.consolidation.bestofbreed.BestOfBreedUDAF';
  4. Switch off the aggregation of data between Reducer and Mapper, by setting the Hive.Map.Aggr environment variable configuration to false, as indicated in this example:
    set hive.map.aggr = false;
    Note: This configuration is required for all UDFs.
  5. Specify the configurations and other details for the job, and assign these to respective variables or configuration properties.
    Note: The rule must be in JSON format.

    For example,

    set hivevar:rule='{"consolidationConditions": [{"consolidationRule":{"conditionClass":"simpleRule",
    "operation":"HIGHEST", "fieldName":"column2", "value":null, "valueFromField":false, "valueNumeric":true},
    "actions":[]}], "removeDuplicates":true}';
    Note: Use the configuration properties in the respective job configurations. For example, pb.bdq.match.rule, pb.bdq.match.express.column, and pb.bdq.consolidation.sort.field where indicated in the respective sample HQL files.
  6. Specify the header fields of the input table in comma-separated format, and assign to a variable or configuration property.
    set hivevar:header ='column1,column2,column3,column4,column5,id';
    Note: Use the configuration property, where indicated. For example, pb.bdq.match.header, pb.bdq.consolidation.header, and so on where indicated in the respective sample HQL files.
  7. Set the sorting parameter to the alias used in the query with the help of the configuration property 'hivevar:sortfield'.
    set hivevar:sortfield='id';
  8. To run the job and display the job output on the console, write the query as indicated in this example:
    SELECT tmp2.record["column1"],
    	tmp2.record["column2"],
    	tmp2.record["column3"],
    	tmp2.record["column4"],
    	tmp2.record["column5"]
    FROM (
    	SELECT  filter (${hivevar:rule},
    			${hivevar:sortfield},
    			${hivevar:header},
    			innerRowID.column1,
    			innerRowID.column2,
    			innerRowID.column3,
    			innerRowID.column4,
    			innerRowID.column5,
    			innerRowID.id
    	) AS matchgroup  
    	FROM (
    		SELECT column1, column2, column3, column4, column5, rowid(*) 
    		AS id 
    		FROM data
    		) innerRowID 
    	GROUP BY column3 
    	) AS innerResult 
    LATERAL VIEW explode(innerResult.matchgroup) tmp2 AS record ;
    To run the job and dump the job output in a designated file, write the query as indicated in this example:
    INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/HiveUDF/filter/' 
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
    collection items terminated by '||' map keys terminated by ':'
    SELECT tmp2.record["column1"],
    	tmp2.record["column2"],
    	tmp2.record["column3"],
    	tmp2.record["column4"],
    	tmp2.record["column5"]
    FROM (
    	SELECT  filter (innerRowID.column1,
    			innerRowID.column2,
    			innerRowID.column3,
    			innerRowID.column4,
    			innerRowID.column5,
    			innerRowID.id
    	) AS matchgroup  
    	FROM (
    		SELECT column1, column2, column3, column4, column5, rowid(*) 
    		AS id 
    		FROM data
    		) innerRowID 
    	GROUP BY column3 
    	) AS innerResult 
    LATERAL VIEW explode(innerResult.matchgroup) tmp2 AS record ;
    Note: Use the alias defined earlier for the UDF.