Executing the Job

To run the Spark job you must use the spark-submit script in Spark’s bin directory. Make sure to use the appropriate Spark2 jar for your installed distribution of Spark and Scala.

Driver Class:

com.precisely.bigdata.addressing.spark.app.AddressingDriver

Scala 2.12:

/precisely/addressing/software/spark2/sdk/lib/spectrum-bigdata-addressing-spark2_2.12-sdk_version-all.jar

Scala 2.11:

/precisely/addressing/software/spark2/sdk/lib/spectrum-bigdata-addressing-spark2_2.11-sdk_version-all.jar

For example:

spark-submit
--class com.precisely.bigdata.addressing.spark.app.AddressingDriver
--master yarn --deploy-mode cluster
/precisely/addressing/software/spark2/sdk/lib/spectrum-bigdata-addressing-spark2_scala_version-sdk_version-all.jar
--operation geocode
--resources-location hdfs:///precisely/addressing/software/resources/
--data-location hdfs:///precisely/geo_addr/data/
--download-location /precisely/downloads
--preferences-filepath
hdfs:///precisely/addressing/software/resources/config/preferences.yaml
--input /user/sdkuser/customers/addresses.csv
--input-format=csv
--csv header=false
--output /user/sdkuser/customers_addresses
--output-format=parquet
--parquet compression=gzip
--input-fields addressLines[0]=0 addressLines[1]=1
--output-fields address.formattedStreetAddress address.formattedLocationAddress location.feature.geometry.coordinates.x
 location.feature.geometry.coordinates.y 
--combine 
--limit 20

Job Parameters

All parameters are declared with a double dash. The required fields are bolded.

Parameter Example
--input

The location to the input file.

--input /user/sdkuser/customers/addresses.csv
--output

The location of the directory for the output, which will include all input columns along with the fields requested in the output-fields parameter as well as any errors.

--output /user/sdkuser/customers_geocoded
--output-fields

The requested fields to be included in the output. Multiple output field expressions should be separated by a space and each individual expression should be surrounded by double quotes. For more information, see Output Fields.

--output-fields "location.feature.geometry.coordinates.x as x" "location.feature.geometry.coordinates.y as y"
--error-field

Add the error field to your output to see any error information.

--error-field error

--json-output-field

Add the Json Output field to your output to see the Json Response.

--json-output-field jsonOutput
--resources-location

Location of the resources directory which contains the configurations and libraries.

If using a remote path, e.g. HDFS or S3, then set --download-location. Local paths must be present on all nodes that tasks will run on.

--resources-location hdfs:///precisely/addressing/software/resources/
--data-location

File path(s) to one or more geocoding datasets. A path may be a single dataset (extracted or an unextracted SPD), or a directory of datasets. Multiple paths must be separated with a space.

If using a remote path, e.g. HDFS or S3, then you must set --download-location. Local paths must be present on all nodes that tasks will run on.

--data-location hdfs:///precisely/geo_addr/data/
--operation

The operation to be performed. One of the following:

  • verify
  • geocode
  • reverseGeocode
  • lookup
--operation verify
--preferences-filepath

File path of the addressing preferences file. This optional file can be edited by advanced users to change the behavior of the geocoder.

If using a remote path, e.g. HDFS or S3, then set --download-location. Local paths must be present on all nodes that tasks will run on.

--preferences-filepath hdfs:///precisely/addressing/ software/resources/config/preferences.yaml
--input-fields

Input fields as address field mappings, using mixed or camelCase form.

For more information, see Input Fields.

  • Specifying individual address fields by input column index:

    --input-fields street=0 city=1 admin1=2 postalCode=3

  • Using column names from the input CSV file (requires a header in the CSV file and setting --csv header=true):

    --input-fields street=street city=city admin1=state postalCode=zip

  • Specifying input as a single line, where multiple input fields are concatenated into one address field:

    --input-fields addressLines[0]=0,1,2,3

--download-location
Location of the directory where reference data will be downloaded to. This path must exist on every data node.
Note: This parameter is required if the reference data is distributed remotely via HDFS or S3.
--download-location /precisely/downloads
--download-group

This property is only used for POSIX-compliant platforms like Linux. It specifies the operating system group which should be applied to the downloaded data on a local file system, so that each Hadoop service can update the data when required. This group should be present on all nodes in the cluster and the operating system user executing the Hadoop service should be a part of this group.

For more information, see Download Permissions.
Note: Use only if reference data is distributed remotely via HDFS or S3.
--download-group dm_users
--extraction-location

File path to where the geocoding datasets will be extracted. If not specified, the default location is the same directory as the SPD.

--extraction-location /precisely/geo_addr/data/extractionDirectory
--country

If your input data does not have country information then you can specify the country as a parameter. Alternatively, you can use a column reference in --input-fields (for example: --input-fields country=2)

--country USA
--overwrite

Including this parameter will tell the job to overwrite the output directory. Otherwise the job will fail if this directory already has content. This parameter does not have a value.

--overwrite
--num-partitions

The minimum number of partitions used to split up the input file.

--num-partitions=15
--combine
Including this parameter will tell the job to combine all output files into a single output file. Otherwise the job will create multiple output files and the number of output files will depend on number of partitions specified by user.
Note: Using this parameter may increase your job's execution time since the entire output dataset must be collected on a single node. As the size of the data to be combined grows, especially past the size of the space available on a single node, there is a chance of getting errors.
--combine
--input-format

The input format. Valid values: csv or parquet. If not specified, the default is csv.

--input-format=parquet
--output-format

The output format. Valid values: csv or parquet. If not specified, the default is the input-format value.

--output-format=csv
--csv

Specify the options to be used when reading and writing CSV input and output files.

Common options and their default values:

  • delimiter:,
  • quote:"
  • escape:\
  • header: false
  • Specify individual options:

    --csv header=true

    --csv delimiter='\t'

  • Specify multiple options:

    --csv header=true delimiter='\t'

--parquet

Specify the options to be used when reading and writing parquet input and output files.

--parquet compression=gzip
--limit

The maximum number of records to be processed in the job.

--limit 5000