Sample Configuration Files

The sample configuration XML files provide a simpler way of running the map reduce and spark jobs for various address and data quality activities. These files are targeted for users who want to run the jobs without having to understand java code. The files contain properties in the form of key-value pairs, which can be modified according to need.

You can run the required job using command prompt (for Linux system) and an SSH client, such as Putty (for Windows and Unix systems).

The sample configuration files are shipped as part of the Spectrum™ Technology Platform SDK and you can access these at this location after you install the SDK:

<Big Data Quality bundle>\samples\configuration\mr: For MR jobs
<Big Data Quality bundle>\samples\configuration\spark: For Spark jobs

File Types

Each of the folders on these locations has these types of configuration XML files, which have properties, in the form of parameters and values, needed to process the jobs. You can customize the values according to the requirement of the job you are running.

inputFileConfig.xml: Specifies the input file properties, such as the type of the input file, path where it's kept, record delimiters, field delimiters, text qualifiers, and the file header details.
<job>Config.xml (for example, addressValidationConfig): Specifies job-related properties, such as job type, job name, input options, and rule configuration or engine configuration.
mapReduceConfig.xml: Specifies the MapReduce configuration parameters. Use this file for customizing any of the MapReduce parameters, such as mapreduce.map.memory.mb, mapreduce.reduce.memory.mb and mapreduce.map.speculative, as needed for your job.
OutputFileConfig.xml: Specifies the type of output file, its location, field delimiter used in the file, if the header file needs to be created, and if report counters are to be printed to file or on console.