Classes

This section describes the Classes and APIs used in Location Intelligence SDK for Big Data.

SpatialAPI

class li.SpatialAPI.SpatialAPI

This class contains all the supported operations methods for spatial operations.

Supported spatial operations:

  • PointInPolygon

  • SearchNearest

  • JoinByDistance

  • GenerateHexagon

static generateHexagon(sparkSession: pyspark.sql.SparkSession, minLongitude: float, minLatitude: float, maxLongitude: float, maxLatitude: float, hexLevel: int = 1, containerLevel: int = 1, numOfPartitions: int = 1, maximumNumOfRowsPerPartition: int = 1)

A HexagonGeneration Operation: This method generates the hexagons within a bounding box defined by minimum and maximum value of longitude and latitude Hexagon output can be used for map display.

Parameters
  • sparkSession (pyspark.sql.SparkSession) – Spark session to be used

  • minLongitude (float) – Minimum longitude value of the bounding box for which hexagons needs to be generated

  • minLatitude (float) – Minimum latitude value of the bounding box for which hexagons needs to be generated

  • maxLongitude (float) – Maximum longitude value of the bounding box for which hexagons needs to be generated

  • maxLatitude (float) – Maximum latitude value of the bounding box for which hexagons needs to be generated

  • hexLevel (int) – The level to generate hexagons for. Must be between 1 and 11, defaults to 1

  • containerLevel (int) – A hint for providing some parallel hexagon generation. Must be less than the hexLevel parameter, defaults to 1

  • numOfPartitions (int) – Number of partitions, defaults to 1

  • maximumNumOfRowsPerPartition (int) – Maximum number of rows per partition, defaults to 1

Returns

A dataframe representing the hexagons in WKT format

Return type

pyspark.sql.DataFrame

static joinByDistance(df1: pyspark.sql.DataFrame, df2: pyspark.sql.DataFrame, df1Longitude: str, df1Latitude: str, df2Longitude: str, df2Latitude: str, searchRadius: float, distanceUnit: str, geoHashPrecision: int = 7, options: Optional[dict] = None)

A JoinByDistance Operation: This method joins two dataframes taking longitude and latitude values, one set from each dataframe, representing the location of the record to be joined. The coordinate values must be in CoordSysConstants.longLatWGS84 coordinate system. This method also takes a searchRadius, which is the buffer around the first point to search for the second point to be inside. The last parameter is a geohash precision that will be used within the calculation.

Parameters
  • df1 (pyspark.sql.DataFrame) – The dataframe to join to

  • df2 (pyspark.sql.DataFrame) – The dataframe to be joined

  • df1Longitude (str) – The Longitude value from the first dataframe

  • df1Latitude (str) – The Latitude value from the first dataframe

  • df2Longitude (str) – The Longitude value from the second dataframe

  • df2Latitude (str) – The Latitude value from the second dataframe

  • searchRadius (float) – The buffer length around point 1 to search for point 2

  • distanceUnit (str) – Unit of measurement for searchRadius parameter.

  • geoHashPrecision (int) – The geohash precision value to be used for search, defaults to 7.

  • options (dict) – A key/value map of DistanceJoinOption that apply to the join, defaults to None.

Returns

A dataframe that is the result of the join

Return type

pyspark.sql.DataFrame

static pointInPolygon(inputDF: pyspark.sql.DataFrame, tableFileType: str, tableFilePath: str, tableFileName: str, longitude: str, latitude: str, outputFields: list, downloadManager=None, libraries: Optional[str] = None, includeEmptySearchResults: bool = True)

A PointInPolygon Operation: This method filters the point coordinates in input dataframe which are within a specified polygon. (for example, the polygon of the continental USA). Adds output fields from polygon table to input dataset as columns.

Parameters
  • inputDF (pyspark.sql.DataFrame) – dataframe of input dataset

  • tableFileType (str) – Type of target polygon data file (either TAB/shape/geodatabase)

  • tableFilePath (str) – Path to polygon data files

  • tableFileName (str) – Name of the TAB/shape/geodatabase file

  • longitude (str) – Name of column containing longitude values in input point data

  • latitude (str) – Name of column containing latitude values in input point data

  • outputFields (list) – The requested fields to be included in the output

  • downloadManager (DownloadManager) – DownloadManager instance to be used if data is present in S3 or HDFS, defaults to None.

  • libraries (str) – libraries in case of geodatabase tableFileType, defaults to None.

  • includeEmptySearchResults (bool) – if true then an empty search will keep the original input row and the new columns will be null and if false then an empty search will result in the row not appearing in the outputted DataFrame, defaults to true

Returns

input DataFrame appended with output fields as columns if point coordinates lie within specified polygon

Return type

pyspark.sql.DataFrame

static searchNearest(inputDF: pyspark.sql.DataFrame, tableFileType: str, tableFilePath: str, tableFileName: str, geometryStringType: str, geometryColumnName: str, outputFields: list, distanceValue: float, distanceUnit: str, distanceColumnName: str = 'distance', downloadManager=None, libraries: Optional[str] = None, maxCandidates: int = 1000, includeEmptySearchResults: bool = True)

A SearchNearest Operation: This method takes in a geometry string (either in GeoJSON, WKT, KML or WKB format) and searches for it in a table of geometries within a specified distance. Searched geometries counts can be limited by defining maxCandidates parameter. By default, geometries are listed from nearest to farthest.

Parameters
  • inputDF (pyspark.sql.DataFrame) – dataframe of input dataset

  • tableFileType (str) – Type of target polygon data file (either TAB/shape/geodatabase)

  • tableFilePath (str) – Path to polygon data files

  • tableFileName (str) – Name of the TAB/shape/geodatabase file

  • geometryStringType (str) – Type of geometry string provided in input file. Supported values are WKT/GeoJSON/WKB/KML

  • geometryColumnName (str) – Name of column containing string representation of geometry

  • outputFields (list) – The requested fields to be included in the output

  • distanceValue (float) – The absolute value of distance from source geometry within which target geometries will be searched for.

  • distanceUnit (str) – Unit of measurement for distanceValue parameter. This same unit will also be used when appending distance column in output dataframe.

  • distanceColumnName (str) – Name of the distance column in output dataframe which indicates distance between source geometry and target geometry.

  • downloadManager (DownloadManager) – DownloadManager instance to be used if data is present in S3 or HDFS, defaults to None.

  • libraries (str) – libraries in case of geodatabase tableFileType, defaults to None.

  • maxCandidates (int) – Limits the count of target geometries to search for, defaults to 1000.

  • includeEmptySearchResults (bool) – if true then an empty search will keep the original input row and the new columns will be null and if false then an empty search will result in the row not appearing in the outputted DataFrame, defaults to true.

Returns

input DataFrame appended with output fields as columns if distance between source geometry and target geometry is within distanceValue. Also, an additional column with name distanceColumnName is returned indicating the distance between source and target geometry and records are ordered by ascending value of this column.

Return type

pyspark.sql.DataFrame

SQLRegistrator

class li.SQLRegistrator.SQLRegistrator
static registerAll()

Registers the pre-defined LI SQL UDF and UDT functions to execute the SQL operations.

Param

None

Returns

None

Return type

None

DistanceJoinOption

class li.DistanceJoinOption.DistanceJoinOption

Options for the distance join operations.

DistanceColumnName

Adds a column to the result dataframe that contains the distance calculated.

LimitMatches

Limits the number of joined results for each source dataframe record. The argument should be a number, and the match results will be limited to those that rank at the number or lower based on distance. Default is no limit.

LimitMethod

The method used for ranking matches. The argument should be a LimitMethods value

LimitMethods

class li.LimitMethods.LimitMethods

LimitMethods options for providing value in DistanceJoinOption.LimitMethod

DenseRank

A DenseRank window function for limiting matches.

Rank

A Rank window function for limiting matches.

RowNumber

A RowNumber window function for limiting matches.

DownloadManagerBuilder

class li.DownloadManagerBuilder.DownloadManagerBuilder(downloadLocation=None, permissions=None)

This builder class to configure downloading from remote paths.

addDownloader(downloader)

Adds a configured download manager to use when downloading from remote paths. If multiple download managers claim to support a path, then the download manager added first will be used.

Parameters

downloader – The downloader configuration for remote path.

Returns

An object of DownloadManagerBuilder class

Return type

DownloadManagerBuilder

build()

Returns the configured DownloadManager used when downloading from remote paths.

Returns

An object of configured DownloadManager class

Return type

DownloadManager

S3Downloader

class li.S3Downloader.S3Downloader(hadoopConfiguration)

This Downloader class is used enabling the use of data located at S3.

getDownloader()

Returns the object reference of S3 Downloader responsible for downloading the data from S3.

Returns

An object of S3Downloader

Return type

S3Downloader

GoogleDownloader

class li.GoogleDownloader.GoogleDownloader(hadoopConfiguration)

This Downloader class is used enabling the use of data located at Google Storage.

getDownloader()

Returns the object reference of Google Downloader responsible for downloading the data from Google Storage.

Returns

An object of GoogleDownloader

Return type

GoogleDownloader

HDFSDownloader

class li.HDFSDownloader.HDFSDownloader(hadoopConfiguration)

This Downloader class is used enabling the use of data located at HDFS.

getDownloader()

Returns the object reference of HDFS Downloader responsible for downloading the data from HDFS.

Returns

An object of HDFSDownloader

Return type

HDFSDownloader

LocalFilePassthroughDownloader

class li.LocalFilePassthroughDownloader.LocalFilePassthroughDownloader

This Downloader class is used enabling the use of data located at Local File System.

getDownloader()

Returns the object reference of Local Downloader responsible for downloading the data from Local File System.

Returns

An object of LocalFilePassthroughDownloader

Return type

LocalFilePassthroughDownloader

HadoopConfiguration

class li.HadoopConfiguration.HadoopConfiguration

This configuration class allows you to create HadoopConfiguration.

getHadoopConfiguration()

Returns the wrapped HadoopConfiguration class.

Returns

An object of HadoopConfiguration

Return type

HadoopConfiguration