Performance Tuning Recommendations

The sections below combine many tips that might help you improve the performance of various data quality stages. The recommendations listed will help you achieve performance for Advanced Matching, Data Normalization, and Universal Name stages.

The recommendations below are applicable to all stages:
  • Number of records: Analyzing and filtering the records before sending them for processing improves the performance, as an increase in the number of records leads to a proportional increase in time.
  • Cluster: The performance increases when the processing is performed in clustered mode.

Performance recommendations for Advanced Matching stages

Stage Performance recommendations

Intraflow Match

  • Group size: Minimizing the group size or match key to the optimum level leads to better performance, as, given the same number of records, an increase in group size leads to increased processing time.
  • Express key: Using an express key during matching improves the performance, but it should be evaluated to ensure that the express key is an accurate candidate for the express match.
  • Input data: Receiving input data sorted by match key improves the performance.
  • Match rule: Optimizing the match rule increases the performance, as a complex match rule degrades the performance.

Interflow Match

  • Group size: Minimizing the group size or match key to the optimum level leads to better performance, as, given the same number of records, an increase in group size leads to increased processing time.
  • Match rule and key: Optimizing the match rule and match keys well, as it plays a crucial role in increasing the performance.

Transactional Match

This stage is fast as compared to the Intraflow Match stage; however, it is recommended to choose the relevant stage according to your requirements as both serve two different purposes. Moreover, unlike Intraflow Match, the group size does not affect the performance as the suspect is matched only once with the candidate.

Best of Breed

Duplicate Synchronization

Filter

Condition: Minimizing the conditions improves the performance, as an increased number of conditions leads to more processing time.

Match Key Generator

Runtime: An increase in runtime instances increases performance.

Candidate Finder The recommendations below improves the performance for Search Index.
  • Runtime: The search operation's performance increases when the runtime instances increases. The machine’s configuration decides on how many runtime instances can be used.

    For example, we observed performance improvement with an increase in the runtime instances.

  • Fields: Create and search operations' performance decreases with more number of fields in the index. However, update operation performance remains almost the same irrespective of the number of fields in the index.

    For example, we observed performance degradation in the case of a search when the number of fields was increased.

  • Batch size: Performance varies when the batch size changes. The optimal number of batch size for the machine is figured out with different batch values. It depends on the machine’s memory and CPU resources.
  • Shards: Update operation improves to some level when the number of shards increases. Search performance degrades with more number of shards.

    For example, we observed that the update gets relatively faster when the shards increase, whereas the search performance degrades.

  • Candidate Finder (CF) conditions: The search takes time when the number of conditions in the candidate finder stage increases.

    For example, we observed performance degradation when the CF query increased.

  • Analyzer: Search using a keyword analyzer is much faster than a standard analyzer.

    For example, we observed a performance improvement when the analyzer changed from standard to keyword.

Note: Search index performance depends on various factors, and the points mentioned above are just an indication of how the performance can vary based on the configuration applied. It is essential to understand the end-user scenario, which is the key driver for deciding on the choice of hardware, index settings, cluster setup, and other configuration parameters to achieve optimal performance.

Performance recommendation for Data Normalization stages

Stage Performance recommendation

Table Lookup

Advanced Transformer

Open Parser

Runtime: An increase in runtime instances increases performance.

Performance recommendation for Universal Name stage

Stage Performance recommendation

Name Parser

Runtime: An increase in runtime instances increases performance.