Batch vs Transactional Point-In-Polygon

The concept of batch vs transactional point-in-polygon operations is really the difference if you have all your point and geometry data already, or if your point or geometry data are coming into the stage one at time, in a real-time scenario. If your application has points originating at runtime (for example, the location of an insurance policy being generated by a customer query), and only the polygon data exists, it is called a transactional use case. A batch use case by contrast, has both the point and the polygon data persisted in tables. The Spatial Module has two stages for solving point-in-polygon use cases: Query Spatial Data and Centrus Point In Polygon (Legacy Point In Polygon). Each stage has performance differences when performing batch or transactional point-in-polygon. For more information about the Query Spatial Data and Legacy Point In Polygon stages, see:

For transactional point-in-polygon use cases, using the Centrus Point In Polygon stage is preferable over the Query Spatial Data stage. The Point In Polygon stage has a higher performance when real time data is being read into the stage. Where the Query Spatial Data stage out-performs the Centrus Point In Polygon stage is batch processing and scalability. Query Spatial Data offers more flexibility about where the point and polygon data can live, and what table you iterate on. If you know the characteristics of your data (including the complexity of polygons and the number of points and polygons), you can structure your flow to derive optimal performance.

For examples of how to create both a batch Query Spatial Data and transactional Point in Polygon flow, see:

Which data format can you use? The type and format of your data will also determine which operation you use in Spectrum. The Centrus Point in Polygon operation only supports GSB format. If you are performing transactional point-in-polygon, you need performance, and can convert your data to the GSB format, then the Centrus Point in Polygon operation is the faster performing method. It is recommended to only convert your data to GSB if it is required for performance reasons. There are other differences that might affect your needs. For example, TAB files support multiple languages and a greater number of coordinate systems. Where GSB is English only and has a more limited coordinate system support.

If you find that the Centrus Point in Polygon operation will improve your solution, and your data is in a format other than GSB, there is a utility included with Spectrum to transform your existing data to GSB. Spatial Import is a Windows command-line utility used to translate .SHP, TAB, .MIF/.MID, and .DBF files to a Centrus database (.GSB) file. It can be downloaded from the Spectrum Spatial section of the Welcome Page, under Spatial Import on the Utilities tab. Once you download the zip file, you have the option of using the either the 32-bit (x86) or 64-bit (x64) version. See Spatial Module Conversion Utility for more information on using this utility.

Note: If you have reasons not to convert the data into GSB (such as concerns over data duplication), you can use the Query Spatial Data for transactional use cases, and access your tabular data stored in either native TAB files or spatial database sources.

Why Query Spatial Data for batch processing? As previously mentioned, the Query Spatial Data stage offers both more flexibility and more performance options. Take the following two batch processing scenario flows (assume the polygons and points are the same):

  1. Read Spatial Data reads a list of points out of a TAB file(millions of points)

    Spatial Calculator creates longitude and latitude column values from the points

    Centrus Point In Polygon stage queries a large set of polygons (GSB)

  2. Read Spatial Data reads a large set of polygons out of a TAB file

    Query Spatial Data uses a contains filter against a point table (millions of points)

Given these two scenarios, with a pool size of 1 and 1 runtime, the Centrus Point In Polygon (scenario 1) is faster than Query Spatial Data (scenario 2). As the pooled instances and runtimes increase, the performance of scenario 1 remains consistent while the performance of scenario 2 scales effectively. At 4 pooled instances and 4 runtimes, Query Spatial Data is faster than the Centrus Point In Polygon stage. For a batch process where you have all of your points and polygons up front, and you want to process all geometries, the second scenario using the Query Spatial Data stage is much more effective. In cases where the polygon data does not change, and points stream in one at a time (that is, a user enters an address, it gets geocoded, then determines which sales territory the address is in), the first scenario using the Centrus Point In Polygon stage is the faster solution.

Are there other performance considerations? When using Query Spatial Data, you need to consider the number of points and polygons that are being processed. When you have more polygons than points, consider iterating on points (read one point at a time using Read Spatial Data and search against the polygon table using Query Spatial Data). When you have more points than polygons, consider iterating on polygons (read one polygon at a time using Read Spatial Data and search against the point table using Query Spatial Data).

Where is your data? Knowing where your data is located and what type of data you are going to have for your solution is important. For example, having TAB files on the file system vs data in a DBMS will change performance of your operation. Spectrum pushes the processing of certain operations (spatial joins) down to the database (for example, Oracle and SQL Server) which will increase your performance. For example, operations similar to this:

SELECT, FROM flood_plane a, customers b WHERE MI_Contains(a.geom, b.geom)
Note: If you want to run the same operation where both tables are native TAB files (no data in a database), you will achieve better performance if you read the records one at a time from one of the tables, and perform the query against the other table using Query Spatial Data.

Is your data static or changing? If you know your data is not changing (all or just some of your data), and you are using the Query Spatial Data operation, there is a configuration option when creating your named resources that can greatly improve performance. In Spectrum this is know as volatility. By default, all resources have volatility set to true. This means Spectrum is assuming that the data can change at any time, and has to check the data each time it is accessed to determine if it has changed and decide if it needs to load new data.

So how does volatility affect point-in-polygon operations?. If you have volatility set to true, and even if the data source has not changed, just the matter of checking the resource will decrease performance. If you know your data is not going to change, and performance is a requirement, then turn off volatility. If performance is an issue, and you know some of your data is going to change (say client lists or survey points), but some of your data is not going to change (parcels or land or sales areas), then make sure volatility is turned off for your static data. This could greatly increase your overall performance. For more information about volatility and how to change this setting for resources, see Data Source Volatility.