Run Hadoop MapReduce Job
For a list of supported job configurations on both Windows and Linux platforms, see the table Supported job configurations.
Field | Description |
---|---|
Hadoop server | The list of configured Hadoop servers. For information about mapping HDFS file servers through Management Console, see the Administration Guide. |
Jar path | The path of the relevant JAR file for the Hadoop MapReduce job to
run. Note: The JAR must be present at the external client location or
Spectrum server. It must not be placed on the Hadoop
cluster. |
Driver class | Select one of:
|
Job type | Select one of:
|
Spectrum jobs | Select the required job from the list of Spectrum Big Data Quality
SDK jobs. The list includes these jobs:
On selecting the desired Spectrum job:
|
Class name | The fully qualified name of the driver class of the job. |
Arguments | The space-separated list of arguments. These are passed to the driver
class at runtime to run the job. For example,
To run the Spectrum Big Data Quality SDK MapReduce jobs, pass the various configuration files as a list of arguments. Each argument key accepts the path of a single configuration property file, where each file contains multiple configuration properties. The syntax of the argument list for configuration properties is: [-config <Path to configuration file>]
[-debug] [-input <Path to input configuration file>] [-conf
<Path to MapReduce configuration file>] [-output <Path of
output directory>] For example, for a MapReduce MatchKeyGenerator job: -config
/home/hadoop/matchkey/mkgConfig.xml -input
/home/hadoop/matchkey/inputFileConfig.xml -conf
/home/hadoop/matchkey/mapReduceConfig.xml -output
/home/hadoop/matchkey/outputFileConfig.xml Note: If the same
configuration property key is specified both in the
Arguments field and in the
Properties grid but each points to
different configuration files, the file indicated in the
Properties grid for this property
holds. The sample configuration properties are shipped with the Data and Address Quality for Big Data SDK and are placed at the location <Big Data Quality bundle>\samples\configuration. |
General Tab
Field | Description | Requirement |
---|---|---|
Job name | The name of the Hadoop MapReduce job. | Required |
Input path | The path of the input file for the job. | Required |
Output path | The path of the output file for the job. | Required |
Overwrite output | Indicates if the specified output path must be overwritten in
case it already exists. Note: If this check box is left unchecked,
and the configured output path is found to exist at runtime,
Hadoop throws an exception and the process flow is
aborted. |
Optional |
Mapper class | The fully qualified name of the class that handles the Mapper functionality for the job. | Required |
Reducer class | The fully qualified name of the class that handles the Reducer functionality for the job. | Optional |
Combiner class | The fully qualified name of the class that handles the Combiner functionality for the job. | Optional |
Partitioner class | The fully qualified name of the class that handles the Partitioner functionality for the job. | Optional |
Number of reducers | The number of reducers used to run the MapReduce job. | Optional |
Input format | The format of the input data. | Required |
Output format | The format of the output data. | Required |
Output key class | The datatype of the keys in the output key-value pairs. | Required |
Output value class | The datatype of the values in the output key-value pairs. | Required |
Properties Tab
To specify additional properties to run the required job, use this tab to define as many property-value pairs as required. You can add the required properties directly in the grid one at a time.
<configuration>
<property>
<name>key</name>
<value>some_value</value>
<description>A brief description of the
purpose of the property key.</description>
</property>
</configuration>
You can directly import the Hadoop property file mapred.xml, or create your own files using this XML format.
- If the same property is defined here and in Management Console, the values defined here override the ones defined in Management Console.
- If the same property exists both in the grid and also in the imported property file, then the value imported from the file overwrites the value existing in the grid for the same property.
- You can import multiple property files one after the other, if required. The properties included in each imported file are added in the grid.
- Ensure the property file is present on the Spectrum™ Technology Platform server itself.
- The
<description>
tag is optional for each property key in a configuration property file. - Reference data needs to be placed local to data nodes to run the relevant jobs. This property is available only for jobs that use reference data, such as Advanced Transformer, Validate Address Global, and Validate Address. The property is: pb.bdq.reference.data.location.
Dependencies Tab
In this tab, add the list of the input files and Jar files required to run the job.Once the job is run, the reference files and the reference Jar files added here are available in the distributed cache of the job.
- Reference Files
- To add the various files required as input to run the job, click
Add, go to the respective location on your local
system or cluster, and select the particular file.
To remove any file added in the list, select the particular file and click Remove.
- Reference Jars
- To add the Jar files required to run the job, click
Add, go to the respective location on your local
system or cluster, and select the particular Jar file.
To remove any file added in the list, select the particular file and click Remove.