Submit Spark Job

The Submit Spark Job activity allows you to run any Spark job, either on a Hadoop cluster or on a Spark cluster. Using this activity, you can run either a Spark job of the Spectrum™ Big Data Quality SDK or any external Spark job.

Currently, you may submit a Spark job to either one of the two cluster types:
  • YARN
  • Spark

Deployment Modes

For a Spark job, you can use the cluster or client deployment modes. These deployment modes determine whether the Spark job driver class runs on the cluster or on the client Spectrum™ Technology Platform.

To simplify, you can run a Spark job in any one of the deployment modes:
  1. YARN Cluster mode
  2. YARN Client mode
  3. Spark Client mode
Attention: The YARN or Spark client modes are recommended to be run when the Spectrum server is installed and run from within the cluster environment.
Field Description
Job name The name of the Spark job.
Hadoop server The list of configured Hadoop servers.

For information about mapping HDFS file servers through Management Console, see the Administration Guide.

Jar path The path of the relevant JAR file for the Spark job to run.
Note: The Jar path must point to a directory on the Spectrum server machine.
Job type Select one of:
Spectrum
To run any one of the Spectrum Big Data Quality SDK jobs, select Spectrum.

On selecting Spectrum, the field Spectrum jobs is displayed.

Generic
To specify additional properties for any external job, select Generic.
Spectrum jobs Select the required job from the list of Spectrum Big Data Quality SDK jobs. The list includes these jobs:
  • Address Validation
  • Advanced Transformer
  • Best of Breed
  • Duplicate Syncronization
  • Filter
  • Groovy
  • Intraflow Match
  • Interflow Match
  • Joiner
  • Match Key Generator
  • Open Name Parser
  • Open Parser
  • Table Lookup
  • Transactional Match
  • Validate Address
  • Validate Address Global
On selecting the desired Spectrum job:
  1. The fields Job name, Class name, and Arguments are auto-populated.

    All the auto-populated fields can be edited as required, except Class name.

    Important: For the selected Spectrum job, the auto-populated Class name must not be edited, else the job cannot run.
  2. The Properties grid is auto-populated with the required configuration properties of the selected Spectrum job, with their default values.

    You can add or import more properties as well as modify the auto-populated properties, as required.

Class name The fully qualified name of the driver class of the job.
Arguments The space-separated list of arguments. These are passed to the driver class at runtime to run the job.

For example,

23Dec2016 /home/Hadoop/EYInc.txt
  1. Those variables can be passed as arguments, which are defined to accept runtime values either in the source stage or this current stage of the process flow.

    For example, if in the output of the previous stage of the process flow the variable SalesStartRange is defined, you can include this variable in this argument list as ${SalesStartRange} along with other required arguments, as illustrated:

    23Dec2016 /home/Hadoop/EYInc.txt ${SalesStartRange}
  2. In case a particular argument contains a space, enclose it in double quotes. For example, "/home/Hadoop/Sales Records".

Spectrum Big Data Quality SDK Jobs - Arguments:

To run the Spectrum Big Data Quality SDK Spark jobs, pass the various configuration files as a list of arguments. Each argument key accepts the path of a single configuration property file, where each file contains multiple configuration properties.

The syntax of the argument list for configuration properties is:

[-config <Path to configuration file>] [-debug] [-input <Path to input configuration file>] [-conf <Path to Spark configuration file>] [-output <Path of output directory>]

For example, for a Spark MatchKeyGenerator job:

-config /home/hadoop/spark/matchkey/matchKeyGeneratorConfig.xml -input /home/hadoop/spark/matchkey/inputFileConfig.xml -output /home/hadoop/spark/matchkey/outputFileConfig.xml
Note: If the same configuration property key is specified both in the Arguments field and in the Properties grid but each points to different configuration files, the file indicated in the Properties grid for this property holds.

The sample configuration properties are shipped with the Data and Address Quality for Big Data SDK and are placed at the location <Big Data Quality bundle>\samples\configuration.

General Properties

Field Description
Master Select any of the options using which the Spark job is to run:
YARN
To launch and manage the Spark job using YARN.
Spark
To launch and manage the Spark job using a Spark application.
Spark URL The URL to access the Spark cluster in the format <hostname of Spark cluster>:<port of Spark cluster>.

This field becomes visible if you select Spark in the Master field .

Deploy-mode Select any of the options:
Client
To run the Spark job driver on the client Spectrum™ Technology Platform.
Cluster
To run the Spark job driver on a cluster.
Properties In the grid, under the Property column enter the names of the properties, and under the Value column enter the values of the corresponding properties.

There are certain mandatory properties depending on the type of Master and Deploy-mode.

YARN Mandatory Properties  
yarn.resourcemanager.hostname The IP address of the YARN ResourceManager.
yarn.resourcemanager.address The address including the IP address and port of the YARN ResourceManager in the format <hostname>:<port>.
Client Deploy Mode Properties    
Spark.driver.host The IP address of the machine on which the Spark driver is to run. Required
spark.client.mode.temp.location The path of the temp folder on the Spectrum server to be used for the Universal Addressing jobs:
  • Validate Address
  • Validate Address Global
Note: We strongly recommend using this property for the Universal Addressing jobs to ensure the specified temp folder is used for intermediate results.
Optional
Thus:
  1. For YARN cluster mode, the first two properties are mandatory.
  2. For YARN client mode, all three properties are mandatory.
  3. For SPARK client mode, the third property is mandatory.
Note: You can define the above mandatory properties either while creating the connection in Management Console, or in this Spark activity. If the same properties are defined both in Management Console and also in the Spark Job activity, then the values assigned in the Spark activity are applicable.
In addition to these mandatory properties, you can enter or import as many more properties as needed to run the job.
Import To import properties from a file, click Import. Go to the location of respective property file and select the file of XML format. The properties contained in the imported file are copied into the Properties grid.
Note:
  1. If the same property is defined here and in Management Console, the values defined here override the ones defined in Management Console.
  2. The property file must be in XML format and must follow the syntax:
    <configuration>
        <property>
            <name>key</name>
            <value>some_value</value>
            <description>A brief description of the 
              purpose of the property key.</description>
        </property>
    </configuration>
    Create your own property files using the above XML format.
  3. If the same property exists both in the grid and also in the imported property file, then the value imported from the file overwrites the value existing in the grid for the same property.
  4. You can import multiple property files one after the other, if required. The properties included in each imported file are added in the grid.
  5. Ensure the property file is present on the Spectrum™ Technology Platform server itself.
  6. The <description> tag is optional for each property key in a configuration property file.
  7. Reference data needs to be placed local to data nodes to run the relevant jobs. This property is available only for jobs that use reference data, such as Advanced Transformer, Validate Address Global, and Validate Address. The property is: pb.bdq.reference.data.location.

Dependencies

In this tab, add the list of the input files and Jar files required to run the job.

Once the job is run, the reference files and the reference Jar files added here are available in the distributed cache of the job.

Reference Files
To add the various files required as input to run the job, click Add, go to the respective location on your local system or cluster, and select the particular file.

To remove any file added in the list, select the particular file and click Remove.

Reference Jars
To add the Jar files required to run the job, click Add, go to the respective location on your local system or cluster, and select the particular Jar file.

To remove any file added in the list, select the particular file and click Remove.

Note: The Jar path must point to a directory on the Spectrum server machine.