Knowing Your Data

When creating a point-in-polygon solution, it is important to understand how your data will affect the performance and choices of which Spectrum operation you will use, and some of the limiting factors of your data.

Where is your data? Knowing where your data is located and what type of data you are going to have for your solution is important. For example, having TAB files on the file system vs data in a DBMS will change performance of your operation. Spectrum pushes the processing of certain operations (spatial joins) down to the database (e.g., Oracle and SQL Server) which will increase your performance. For example, operations similar to the following:

Select a.id, b.id from flood_plane a, customers b where MI_Contains(a.geom, b.geom)

What is the geometry format? There is a difference in performance when using a MapInfo native geometry format vs a file with x/y format (e.g., CSV file with lat/long values). To improve performance, consider using a x/y format instead of MapInfo native geometry format.

Is your data static or changing? If you know your data is not changing (all or just some of your data), and you are using the Query Spatial Data operation, there is a configuration option when creating your named resources that can greatly improve performance. In Spectrum this is know as volatility. By default, all resources have volatility set to true. This means Spectrum is assuming that the data can change at any time, and has to check the data each time it is accessed to determine if it has changed and decide if it needs to load new data.

So how does volatility affect point-in-polygon operations?. If you have volatility set to true, and even if the data source has not changed, just the matter of checking the resource will decrease performance. If you know your data is not going to change, and performance is a requirement, then turn off volatility. If performance is an issue, and you know some of your data is going to change (say client lists or survey points), but some of your data is not going to change (parcels or land or sales areas), then make sure volatility is turned off on as many of your static data as possible. This will increase performance. For more information on volatility and how to change this setting for resources, see Data Source Volatility.

Are you using TAB files? When using TAB files, you have the ability to maintain a pool of open file handles to avoid the expense of opening and reopening every time the file is read. Spectrum Spatial will use the file handle pool for native TAB files whose volatility setting is false. Native TAB files include native Extended and Seamless TAB files. All tables in the Spectrum Spatial repository are, by default, volatile (true). Volatility for Native TAB files means that the schema could change at any time. To take advantage of this performance boost, set the volatility setting to false in Spatial Manager. In general, setting volatility to false is recommended if the data will only be changing at known time periods or not at all.

The file handle pool is enabled by default. To turn it off, go into \server\modules\spatial\pool-tab.properties and set tab.cache.enabled to false. You must restart the server for the setting to take effect.

Configuration of the file handle pool is done through the tab-file-handle-pool.properties file, also located in the \server\modules\spatial folder. Among the properties are the maximum number of handles that can be allocated to the pool (maxTotal), the maximum number of allocated handles per file (maxTotalPerKey), and the minimum length of time a file handle can sit in the pool unused before being closed (minEvictableIdleTimeMillis).

For seamless tables, there is general formula for maximizing the performance of the file handle pool. Specifically, you have to calculate the maximum number of handles that can be allocated to the pool (maxTotal). Use the following steps to calculate the maxTotal:

  1. Find the seamless table with the most number of sub-tables, and note the number of sub-tables (#ofsub-tables).
  2. Determine the number of threads you are using (#ofthreads).
  3. The formula is (3 + (3 x #ofsub-tables)) x #ofthreads = maxTotal. If your seamless tables do not have .ind files, the formula is (2 + (2 x #ofsub-tables)) x #ofthreads = maxTotal.

For example, if you are using the entire seamless USA table, there are .ind files, there are 54 sub-tables, and you are using 8 threads. The calculation for maxTotal is (3 + (3 x 54)) x 8 = 1320.

Note: There is a potential issue, depending upon which OS you use, you may run out of open files handles.
Note: The value of maxTotalPerKey should be increased to the number of threads you are using, if you are using more than 10 threads.