Configuration Files

These tables describe the parameters and the values you need to specify before you run the Match Key Generator job.

Table 1. inputFileConfig
Parameter Description
pb.bdq.input.type Input file type. The values can be: TEXT, ORC or PARQUET.
pb.bdq.inputfile.path The path where you have placed the input file on HDFS. For example, /user/hduser/sampledata/matchkeygenerator/ input/MatchKey_Input.csv.
textinputformat.record.delimiter File record delimiter used in the text type input file. For example, LINUX, MACINTOSH, or WINDOWS
pb.bdq.inputformat.field.delimiter Field or column delimiter used in the input file, such as comma (,) or tab.
pb.bdq.inputformat.text.qualifier Text qualifiers, if any, in the columns or fields of the input file.
pb.bdq.inputformat.file.header Column headers as comma-separated values. For example, businessname, id, and domain.
pb.bdq.inputformat.skip.firstrow If the first row is to be skipped from processing. The values can be True or False, where True indicates skip.
Table 2. mapReduceConfig
Specifies the MapReduce configuration parameters
Customize MapReduce parameters, such as, mapreduce.reduce.memory.mb and, as needed for your job.
Note: Use this file only for MapReduce jobs.
Table 3. matchKeyGeneratorConfig
Parameter Description
pb.bdq.job.type This is a constant value that defines the job. The value for this job is: MatchKeyGen. Name of the job. Default is MatchKeySample.
pb.bdq.match.keygenerator.json Json string for match key generator rules, such as algorithm to be used to generate the match key, field to which you want to apply the selected algorithm, starting position within the specified field, length of characters to include from the starting position, if non-numeric and non-alpha characters are to be removed, and if the input fields are to be sorted.
Table 4. outputFileConfig
Parameter Description
pb.bdq.output.type Output file type. The values can be: TEXT, ORC or PARQUET.
pb.bdq.outputfile.path The path where you want the output file to be generated on HDFS.
pb.bdq.outputformat.field.delimiter Field or column delimiter in the output file, such as comma (,) or tab.
pb.bdq.output.overwrite For a true value, the output folder is overwritten every time job is run.
pb.bdq.outputformat.headerfile.create Specify true, if the output file needs to have a header.
Properties of Parquet file
parquet.compression The compression algorithm used to compress pages. It is one of these: UNCOMPRESSED, SNAPPY, GZIP, or LZO.


parquet.block.size The size of a row group being buffered in memory.

Larger values improve the I/O when reading but consume more memory when writing.

Default size is 134217728 bytes (= 128 * 1024 * 1024) Page constitutes block and is the smallest unit that must be read fully to access a single record.
Default size is 1048576 bytes (= 1 * 1024 * 1024)
Note: A very small page size results in deterioration of compression. Default size is 1048576 bytes (= 1 * 1024 * 1024)
parquet.enable.dictionary The boolean value (True or False) to enable or disable dictionary encoding. Default is True
parquet.validation Default boolean value is False.
parquet.writer.version Specifies the version of writer. It should be PARQUET_1_0 or PARQUET_2_0. Default is PARQUET_1_0.
parquet.writer.max-padding Default to no padding, 0% of the row group size Default boolean value is True Default is 100 Default is 10000