Submit Spark Job

Deployment Modes

For a Spark job, you can use the cluster or client deployment modes. These deployment modes determine whether the Spark job driver class runs on the cluster or on the client Spectrum™ Technology Platform.

To simplify, you can run a Spark job in any one of the deployment modes:

YARN Cluster mode
YARN Client mode
Spark Client mode

Note: The YARN or Spark client modes are recommended when the Spectrum server is installed and runs from within the cluster environment.

For a comprehensive list of supported job configurations on both Windows and Linux platforms, see the table Supported job configurations.


Field	Description
Job name	The name of the Spark job.
Hadoop server	The list of configured Hadoop servers. For information about mapping HDFS file servers through Management Console, see the Administration Guide.
Jar path	The path of the relevant JAR file for the Spark job to run. Note: The Jar path must point to a directory on the Spectrum server machine.
Job type	Select one of: Spectrum To run any one of the Spectrum Big Data Quality SDK jobs, select `Spectrum`. On selecting `Spectrum`, the field Spectrum jobs is displayed. Generic To specify additional properties for any external job, select `Generic`.
Spectrum jobs	Select the required job from the list of Spectrum Big Data Quality SDK jobs. The list includes these jobs: Address Validation Advanced Transformer Best of Breed Duplicate Synchronization Filter Groovy Intraflow Match Interflow Match Joiner Match Key Generator Open Name Parser Open Parser Table Lookup Transactional Match Validate Address Validate Address Global On selecting the desired Spectrum job: The fields Job name, Class name, and Arguments are auto-populated. All the auto-populated fields can be edited as required, except Class name. Important: For the selected Spectrum job, the auto-populated Class name must not be edited, else the job cannot run. The Properties grid is auto-populated with the required configuration properties of the selected Spectrum job, with their default values. You can add or import more properties as well as modify the auto-populated properties, as required.
Class name	The fully qualified name of the driver class of the job.
Arguments	The space-separated list of arguments. These are passed to the driver class at runtime to run the job. For example, `23Dec2016 /home/Hadoop/EYInc.txt` Those variables can be passed as arguments, which are defined to accept runtime values either in the source stage or this current stage of the process flow. For example, if in the output of the previous stage of the process flow the variable `SalesStartRange` is defined, you can include this variable in this argument list as `${SalesStartRange}` along with other required arguments, as illustrated: `23Dec2016 /home/Hadoop/EYInc.txt ${SalesStartRange}` In case a particular argument contains a space, enclose it in double quotes. For example, "/home/Hadoop/Sales Records". Spectrum Big Data Quality SDK Jobs - Arguments: To run the Spectrum Big Data Quality SDK Spark jobs, pass the various configuration files as a list of arguments. Each argument key accepts the path of a single configuration property file, where each file contains multiple configuration properties. The syntax of the argument list for configuration properties is: `[-config <Path to configuration file>] [-debug] [-input <Path to input configuration file>] [-conf <Path to Spark configuration file>] [-output <Path of output directory>]` For example, for a Spark MatchKeyGenerator job: `-config /home/hadoop/spark/matchkey/matchKeyGeneratorConfig.xml -input /home/hadoop/spark/matchkey/inputFileConfig.xml -output /home/hadoop/spark/matchkey/outputFileConfig.xml` Note: If the same configuration property key is specified both in the Arguments field and in the Properties grid but each points to different configuration files, the file indicated in the Properties grid for this property holds. The sample configuration properties are shipped with the Data and Address Quality for Big Data SDK and are placed at the location <Big Data Quality bundle>\samples\configuration.

General Properties

Field Description

Master

Select any of the options using which the Spark job is to run:

YARN: To launch and manage the Spark job using YARN.
Spark: To launch and manage the Spark job using a Spark application.

Spark URL

The URL to access the Spark cluster in the format

<hostname of Spark
         cluster>:<port of Spark cluster>

.

This field becomes visible if you select Spark in the Master field .

Deploy-mode

Select any of the options:

Client: To run the Spark job driver on the client Spectrum™ Technology Platform.
Cluster: To run the Spark job driver on a cluster.

Properties

In the grid, under the Property column enter the names of the properties, and under the Value column enter the values of the corresponding properties.

There are certain mandatory properties depending on the type of Master and Deploy-mode.

YARN Mandatory Properties
`yarn.resourcemanager.hostname`	The IP address of the YARN ResourceManager.
`yarn.resourcemanager.address`	The address including the IP address and port of the YARN ResourceManager in the format `<hostname>:<port>`.

Client Deploy Mode Properties
`Spark.driver.host`	The IP address of the machine on which the Spark driver is to run.	Required
`spark.client.mode.temp.location`	The path of the temp folder on the Spectrum server to be used for the Universal Addressing jobs: Validate Address Validate Address Global Note: We strongly recommend using this property for the Universal Addressing jobs to ensure the specified temp folder is used for intermediate results.	Optional

Thus:

For YARN cluster mode, the first two properties are mandatory.
For YARN client mode, all three properties are mandatory.
For SPARK client mode, the third property is mandatory.

Note: You can define the above mandatory properties either while creating the connection in Management Console, or in this Spark activity. If the same properties are defined both in Management Console and also in the Spark Job activity, then the values assigned in the Spark activity are applicable.

In addition to these mandatory properties, you can enter or import as many more properties as needed to run the job.

Import

To import properties from a file, click Import. Go to the location of respective property file and select the file of XML format. The properties contained in the imported file are copied into the Properties grid.

Note:

If the same property is defined here and in Management Console, the values defined here override the ones defined in Management Console.

The property file must be in XML format and must follow the syntax:

<configuration>
    <property>
        <name>key</name>
        <value>some_value</value>
        <description>A brief description of the 
          purpose of the property key.</description>
    </property>
</configuration>

Create your own property files using the above XML format.

If the same property exists both in the grid and also in the imported property file, then the value imported from the file overwrites the value existing in the grid for the same property.
You can import multiple property files one after the other, if required. The properties included in each imported file are added in the grid.
Ensure the property file is present on the Spectrum™ Technology Platform server itself.
The <description> tag is optional for each property key in a configuration property file.
Reference data needs to be placed local to data nodes to run the relevant jobs. This property is available only for jobs that use reference data, such as Advanced Transformer, Validate Address Global, and Validate Address. The property is: pb.bdq.reference.data.location.

Dependencies

In this tab, add the list of the input files and Jar files required to run the job.

Once the job is run, the reference files and the reference Jar files added here are available in the distributed cache of the job.

Reference Files: To add the various files required as input to run the job, click Add, go to the respective location on your local system or cluster, and select the particular file.
To remove any file added in the list, select the particular file and click Remove.
Reference Jars: To add the Jar files required to run the job, click Add, go to the respective location on your local system or cluster, and select the particular Jar file.
To remove any file added in the list, select the particular file and click Remove.

Note: The Jar path must point to a directory on the Spectrum server machine.