Data and Address Quality for Big Data SDK

Reference data strategy

You can manage the reference data placed on Hadoop Distributed File System (HDFS) in one of these ways while running the jobs:

Download it to the current working directory: The reference data gets downloaded to your working directory as temporary files. Every time your job is completed, these files are deleted from the working directory, making fresh download of reference data mandatory for each job.
Download data to a local path: The reference data is downloaded to a local data path you specify, and it remains available for all the jobs till the data gets refreshed on HDFS.

For more information, see the section Using Reference Data in the user guide.

Reference data for UAM, GAM on HDFS

Reference data for jobs of these modules can now also be placed on HDFS, and accessed in MR and Spark jobs and User Defined Functions.

Universal Addressing Module (except Validate Address Loqate)
Global Addressing Module

Silent extraction of reference data

You can now extract and install reference data for Universal Addressing Module also through a silent script, silentInstalldb_unc.sh. The script accepts arguments one time and extracts databases on your machine outside of an interactive process.

Note: You can still use the interactive utility and script sh installdb_unc.sh, if needed.

Open Parser job

You can now use Open Parser job of Data Normalization Module to define a parsing grammar and apply it to parse your input data strings.

For more information about this job, see the section on Open Parser in the Data Normalization Module jobs.

Data Integration jobs

Spectrum™ Data and Address Quality for Big Data SDK has added support to these two Data Integration jobs. You can create these jobs using the Java API with either MapReduce or Spark.

Joiner job: Use this Data Integration Module job to perform SQL-style JOIN operation to combine records from multiple files.
Custom Groovy script job: Use this Data Integration Module job to transform input fields based on the defined groovy script.
Note: You can create and run this job with Hive UDFs as well.

Global Address Validation job

For standardizing and validating international addresses outside the United States, you can now use the Global Address Validation job of Global Addressing Module. You can create a job using the Java API with either MapReduce or Spark, or with Hive UDFs.

Support to S3 native filesystem

The Amazon S3 native filesystem (s3n) client is now available in the Hadoop MapReduce and Spark jobs. You can store and access your input and output files on s3n. You need to provide the path in a specified format as parameter to the filepath sub-class and use it your job.

Acushare service setup

You can now perform a silent Acushare service setup. For this, you need to copy the script from the Spectrum™ Data & Address Quality for Big Data SDK installation path to any location on the node and provide the service installation path in the installer.properties file.

For more information, see the section Running Acushare service in the Data and Address Quality For Big Data SDK Guide.