Candidate Finder

Candidate Finder obtains the candidate records that will form the set of potential matches. Search Index searches work independently from Transactional Match. Depending on the format of your data, Candidate Finder may also need to parse the name or address of the suspect record, the candidate records, or both.

Candidate Finder also enables full-text index searches and helps in defining advanced search criteria against characters and text using various search types (Numeric, Range, Contains All, and Contains None) and conditions (All True and Any True).

Note: HBase NoSQL Database should be available and accessible in the cluster for storing search indexes.

Configuration Files

These tables describe the parameters and the values you need to specify before you run the Candidate Finder job.

Table 1. inputFileConfig
Parameter	Description
pb.bdq.input.type	Input file type. The values can be: `TEXT`, `ORC` or `PARQUET`.
pb.bdq.inputfile.path	The path where you have placed the input file on HDFS. For example, /user/hduser/sampledata/candidatefinder/input/ CandidateFinder_Input.csv
textinputformat.record.delimiter	File record delimiter used in the text type input file. For example, `LINUX`, `MACINTOSH`, or `WINDOWS`
pb.bdq.inputformat.field.delimiter	Field or column delimiter used in the input file, such as comma (`,`) or tab.
pb.bdq.inputformat.text.qualifier	Text qualifiers, if any, in the columns or fields of the input file.
pb.bdq.inputformat.file.header	Comma-separated value of the headers used in the input file. For example, `IN_MonthNumber`,`IN_WeekNumber`,`IN_MonthName`,`IN_WeekdayName`
pb.bdq.inputformat.skip.firstrow	If the first row is to be skipped from processing. The values can be `True` or `False`, where `True` indicates skip.

Table 2. candidateFinderConfig
Parameter	Description
pb.bdq.job.type	This is a constant value that defines the job. The value for this job is: `CandidateFinder`.
pb.bdq.job.name	Name of the job. Default is `CandidateFinderSample`.
pb.bdq.amm.search.cf.query.json	Defines the Json string for Candidate Finder query.
pb.bdq.amm.search.cf.index.output.fields	Specifies which of the stored fields in the Index are to be included in the output.
pb.bdq.amm.search.cf.index.name	Defines the name of the index or the table.
pb.bdq.amm.search.cf.max.results	Specifies the maximum number of responses to be returned by the stage. Default is `10`.
pb.bdq.amm.search.cf.fetch.batchsize	In case the Maximum result is arbitrarily large, specify the size of batches in which you want the results to be processed. This optimizes processing of large number of records. Default is `10000`.
pb.bdq.amm.search.cf.start.record	The record number from which the search should begin. Default is `1`.

Table 3. mapReduceConfig
Specifies the MapReduce configuration parameters
Customize MapReduce parameters, such as mapreduce.map.memory.mb, mapreduce.reduce.memory.mb and mapreduce.map.speculative, as needed for your job. Note: Use this file only for MapReduce jobs.

Note: For this job, you need to specify values for these two additional MapReduce and Spark configuration parameters:

hbase.zookeeper.quorum
hbase.zookeeper.property.clientPort

Table 4. OutputFileConfig
Parameter	Description
pb.bdq.output.type	Specify if the output is in: `TEXT`, `ORC`, or `PARQUET` format.
pb.bdq.outputfile.path	The path where you want the output file to be generated on HDFS. For example, `/user/hduser/sampledata/candidatefinder/output`.
pb.bdq.outputformat.field.delimiter	Field or column delimiter in the output file, such as comma (`,`) or tab.
pb.bdq.output.overwrite	For a `true` value, the output folder is overwritten every time job is run.
pb.bdq.outputformat.headerfile.create	Specify `true`, if the output file needs to have a header.