Submit Spark Job
The Submit Spark Job activity allows you to run any Spark job, either on a Hadoop cluster or on a Spark cluster. Using this activity, you can run either a Spark job of the Spectrum™ Big Data Quality SDK or any external Spark job.
- YARN
- Spark
Deployment Modes
For a Spark job, you can use the cluster or client deployment modes. These deployment modes determine whether the Spark job driver class runs on the cluster or on the client Spectrum™ Technology Platform.
- YARN Cluster mode
- YARN Client mode
- Spark Client mode
For a comprehensive list of supported job configurations on both Windows and Linux platforms, see the table Supported job configurations.
Field | Description |
---|---|
Job name | The name of the Spark job. |
Hadoop server | The list of configured Hadoop servers. For information about mapping HDFS file servers through Management Console, see the Administration Guide. |
Jar path | The path of the relevant JAR file for the Spark job to run. Note:
The
Jar path must point to a directory on the Spectrum server machine. |
Job type | Select one of:
|
Spectrum jobs | Select the required job from the list of Spectrum Big Data Quality
SDK jobs. The list includes these jobs:
On selecting the desired Spectrum job:
|
Class name | The fully qualified name of the driver class of the job. |
Arguments | The space-separated list of arguments. These are passed to the driver class at runtime
to run the job. For example,
To run the Spectrum Big Data Quality SDK Spark jobs, pass the various configuration files as a list of arguments. Each argument key accepts the path of a single configuration property file, where each file contains multiple configuration properties. The syntax of the argument list for configuration properties is: [-config <Path to
configuration file>] [-debug] [-input <Path to input configuration file>] [-conf
<Path to Spark configuration file>] [-output <Path of output
directory>] For example, for a Spark MatchKeyGenerator job: -config
/home/hadoop/spark/matchkey/matchKeyGeneratorConfig.xml -input
/home/hadoop/spark/matchkey/inputFileConfig.xml -output
/home/hadoop/spark/matchkey/outputFileConfig.xml Note: If the same configuration
property key is specified both in the Arguments field and in the
Properties grid but each points to different configuration files, the
file indicated in the Properties grid for this property
holds. The sample configuration properties are shipped with the Data and Address Quality for Big Data SDK and are placed at the location <Big Data Quality bundle>\samples\configuration. |
General Properties
Field | Description | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Master | Select any of the options using which the Spark job is to run:
|
|||||||||||||||
Spark URL | The URL to access the Spark cluster in the format <hostname of Spark
cluster>:<port of Spark cluster> .This field becomes visible if you select Spark in the Master field . |
|||||||||||||||
Deploy-mode | Select any of the options:
|
|||||||||||||||
Properties | In the grid, under the Property column enter the names of the
properties, and under the Value column enter the values of the
corresponding properties. There are certain mandatory properties depending on the type of Master and Deploy-mode.
Note: You can define the above mandatory properties either while creating the
connection in Management Console, or in this Spark activity. If the same properties are
defined both in Management Console and also in the Spark Job activity, then the values
assigned in the Spark activity are applicable. In addition to these mandatory
properties, you can enter or import as many more properties as needed to run the
job. |
|||||||||||||||
Import | To import properties from a file, click Import. Go to the
location of respective property file and select the file of XML format. The properties
contained in the imported file are copied into the Properties
grid. Note:
|
Dependencies
In this tab, add the list of the input files and Jar files required to run the job.Once the job is run, the reference files and the reference Jar files added here are available in the distributed cache of the job.
- Reference Files
- To add the various files required as input to run the job, click
Add, go to the respective location on your local
system or cluster, and select the particular file.
To remove any file added in the list, select the particular file and click Remove.
- Reference Jars
- To add the Jar files required to run the job, click
Add, go to the respective location on your local
system or cluster, and select the particular Jar file.
To remove any file added in the list, select the particular file and click Remove.