Executing the Job
To run the Spark job you must use the spark-submit script in Spark’s bin directory. Make sure to use the appropriate Spark2 jar for your installed distribution of Spark and Scala.
Driver Class:
com.precisely.bigdata.addressing.spark.app.AddressingDriver
Scala 2.12:
/precisely/addressing/software/spark2/sdk/lib/spectrum-bigdata-addressing-spark2_2.12-sdk_version-all.jar
Scala 2.11:
/precisely/addressing/software/spark2/sdk/lib/spectrum-bigdata-addressing-spark2_2.11-sdk_version-all.jar
For example:
spark-submit
--class com.precisely.bigdata.addressing.spark.app.AddressingDriver
--master yarn --deploy-mode cluster
/precisely/addressing/software/spark2/sdk/lib/spectrum-bigdata-addressing-spark2_scala_version-sdk_version-all.jar
--operation geocode
--resources-location hdfs:///precisely/addressing/software/resources/
--data-location hdfs:///precisely/geo_addr/data/
--download-location /precisely/downloads
--preferences-filepath
hdfs:///precisely/addressing/software/resources/config/preferences.yaml
--input /user/sdkuser/customers/addresses.csv
--input-format=csv
--csv header=false
--output /user/sdkuser/customers_addresses
--output-format=parquet
--parquet compression=gzip
--input-fields addressLines[0]=0 addressLines[1]=1
--output-fields address.formattedStreetAddress address.formattedLocationAddress location.feature.geometry.coordinates.x
location.feature.geometry.coordinates.y
--combine
--limit 20
Job Parameters
All parameters are declared with a double dash. The required fields are bolded.
Parameter | Example |
---|---|
--input
The location to the input file. |
--input /user/sdkuser/customers/addresses.csv
|
--output
The location of the directory for the
output, which will include all input columns along with the fields requested in
the |
--output /user/sdkuser/customers_geocoded
|
--output-fields
The requested fields to be included in the output. Multiple output field expressions should be separated by a space and each individual expression should be surrounded by double quotes. For more information, see Output Fields. |
--output-fields "location.feature.geometry.coordinates.x as x"
"location.feature.geometry.coordinates.y as y"
|
--error-field
Add the error field to your output to see any error information. |
--error-field error
|
Add the Json Output field to your output to see the Json Response. |
--json-output-field jsonOutput
|
--resources-location
Location of the resources directory which contains the configurations and libraries. If using a remote
path, e.g. HDFS or S3, then set |
--resources-location
hdfs:///precisely/addressing/software/resources/
|
--data-location
File path(s) to one or more geocoding datasets. A path may be a single dataset (extracted or an unextracted SPD), or a directory of datasets. Multiple paths must be separated with a space. If
using a remote path, e.g. HDFS or S3, then you must set
|
--data-location hdfs:///precisely/geo_addr/data/
|
--operation
The operation to be performed. One of the following:
|
--operation verify
|
--preferences-filepath
File path of the addressing preferences file. This optional file can be edited by advanced users to change the behavior of the geocoder. If using a remote path, e.g. HDFS or S3, then set
|
--preferences-filepath hdfs:///precisely/addressing/
software/resources/config/preferences.yaml
|
--input-fields
Input fields as address field mappings, using mixed or camelCase form. For more information, see Input Fields. |
|
--download-location
Location of the directory where reference data will be downloaded to. This path must exist on
every data node.
Note: This parameter is required if the reference data is
distributed remotely via HDFS or S3. |
--download-location /precisely/downloads
|
--download-group
This property is only used for POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the Hadoop service should be a part of this group. For more information, see Download Permissions.
Note: Use only if reference data is distributed remotely via HDFS or
S3. |
--download-group dm_users
|
--extraction-location
File path to where the geocoding datasets will be extracted. If not specified, the default location is the same directory as the SPD. |
--extraction-location
/precisely/geo_addr/data/extractionDirectory
|
--country
If your input data does not have country
information then you can specify the country as a parameter. Alternatively, you
can use a column reference in |
--country USA
|
--overwrite
Including this parameter will tell the job to overwrite the output directory. Otherwise the job will fail if this directory already has content. This parameter does not have a value. |
--overwrite
|
--num-partitions
The minimum number of partitions used to split up the input file. |
--num-partitions=15
|
--combine
Including this parameter will tell the job to combine all output files into a single output file.
Otherwise the job will create multiple output files and the number of output files
will depend on number of partitions specified by user.
Note: Using this parameter
may increase your job's execution time since the entire output dataset must be
collected on a single node. As the size of the data to be combined grows,
especially past the size of the space available on a single node, there is a
chance of getting errors. |
--combine
|
--input-format
The input format. Valid values: csv or parquet. If not specified, the default is csv. |
--input-format=parquet
|
--output-format
The output format. Valid values: csv or
parquet. If not specified, the default is the |
--output-format=csv
|
--csv
Specify the options to be used when reading and writing CSV input and output files. Common options and their default values:
|
|
--parquet
Specify the options to be used when reading and writing parquet input and output files. |
--parquet compression=gzip
|
--limit
The maximum number of records to be processed in the job. |
--limit 5000
|