Candidate Finder
Candidate Finder obtains the candidate records that will form the set of potential matches. Search Index searches work independently from Transactional Match. Depending on the format of your data, Candidate Finder may also need to parse the name or address of the suspect record, the candidate records, or both.
Candidate Finder also enables full-text index searches and helps in defining advanced search
criteria against characters and text using various search types (Numeric, Range, Contains All,
and Contains None) and conditions (All True and Any True).
Note: HBase NoSQL
Database should be available and accessible in the cluster for storing search
indexes.
Configuration Files
These tables describe the parameters and the values you need to specify before you run the Candidate Finder job.
Parameter | Description |
---|---|
pb.bdq.input.type | Input file type. The values can be: TEXT, ORC or PARQUET. |
pb.bdq.inputfile.path | The path where you have placed the input file on HDFS. For example, /user/hduser/sampledata/candidatefinder/input/ CandidateFinder_Input.csv |
textinputformat.record.delimiter | File record delimiter used in the text type input file. For example, LINUX, MACINTOSH, or WINDOWS |
pb.bdq.inputformat.field.delimiter | Field or column delimiter used in the input file, such as comma (,) or tab. |
pb.bdq.inputformat.text.qualifier | Text qualifiers, if any, in the columns or fields of the input file. |
pb.bdq.inputformat.file.header | Comma-separated value of the headers used in the input file. For example, IN_MonthNumber,IN_WeekNumber,IN_MonthName,IN_WeekdayName |
pb.bdq.inputformat.skip.firstrow | If the first row is to be skipped from processing. The values can be True or False, where True indicates skip. |
Parameter | Description |
---|---|
pb.bdq.job.type | This is a constant value that defines the job. The value for this job is: CandidateFinder. |
pb.bdq.job.name | Name of the job. Default is CandidateFinderSample. |
pb.bdq.amm.search.cf.query.json | Defines the Json string for Candidate Finder query. |
pb.bdq.amm.search.cf.index.output.fields | Specifies which of the stored fields in the Index are to be included in the output. |
pb.bdq.amm.search.cf.index.name | Defines the name of the index or the table. |
pb.bdq.amm.search.cf.max.results | Specifies the maximum number of responses to be returned by the stage. Default is 10. |
pb.bdq.amm.search.cf.fetch.batchsize | In case the Maximum result is arbitrarily large, specify the size of batches in which
you want the results to be processed. This optimizes processing of large number of records.
Default is 10000. |
pb.bdq.amm.search.cf.start.record | The record number from which the search should begin. Default is 1. |
Specifies the MapReduce configuration parameters |
---|
Customize MapReduce parameters, such as mapreduce.map.memory.mb,
mapreduce.reduce.memory.mb and mapreduce.map.speculative, as
needed for your job. Note: Use this file only for MapReduce jobs. |
Note: For this job, you need to specify values for these two additional MapReduce and Spark
configuration parameters:
- hbase.zookeeper.quorum
- hbase.zookeeper.property.clientPort
Parameter | Description |
---|---|
pb.bdq.output.type | Specify if the output is in: TEXT, ORC, or PARQUET format. |
pb.bdq.outputfile.path | The path where you want the output file to be generated on HDFS. For example, /user/hduser/sampledata/candidatefinder/output. |
pb.bdq.outputformat.field.delimiter | Field or column delimiter in the output file, such as comma (,) or tab. |
pb.bdq.output.overwrite | For a true value, the output folder is overwritten every time job is run. |
pb.bdq.outputformat.headerfile.create | Specify true, if the output file needs to have a header. |