Placement and Usage of Reference Data

You can place the reference data at one of these locations:

  • All the data nodes of Hadoop cluster
  • Hadoop Distributed File System (HDFS): If your reference data is on HDFS, you have these two options for managing it when you run the jobs:
    • Download it to the current working directory

      The reference data gets downloaded to your working directory as temporary files. Every time your job is completed, these files are deleted from the working directory, making fresh download of reference data mandatory for each job.

    • Download data to a local path

      The reference data is downloaded to a local data path you specify, and it remains available for all the jobs till the data gets refreshed on HDFS.

Managing reference data on HDFS

For successfully downloading reference data from HDFS to a specified local path and to be able to run jobs using that reference data, you need to ensure:
  • The user executing the jobs has write access to the local drive on each data node.
  • There is sufficient disk space on each of the data nodes to download data from HDFS.

Advantages of downloading data to a local path

  • You do not need to place data on each of the nodes as you do in case reference data is copied to or placed on all the nodes (Localtodatanodes option).
  • On any given data node, same version of data is downloaded only once.
  • There is no limit to reference data download.

Properties to be specified in jobs

These are the properties you need to specify in the job configuration files to indicate the chosen reference data strategy and path.

  • Reference data is placed on all the data nodes of Hadoop cluster:
    In the Json string, specify these details:
    • referenceDataPathLocation: LocaltoDataNodes
    • dataDir: Path where the reference data is located.
    <property>
            <name>pb.bdq.reference.data</name>
            <value>{"referenceDataPathLocation":"LocaltoDataNodes",
                   "dataDir":"/home/data/referenceData"}</value>
            <description>Pass reference data details as JSON format.</description>
    </property>
  • Reference data is placed on Hadoop Distributed File System (HDFS) and you want to use the distributed cache mode: In the Json string, specify these details:
    • referenceDataPathLocation: HDFS
    • dataDir: Path of the reference data on HDFS
    • dataDownloader: DC
     <property>
            <name>pb.bdq.reference.data</name>
            <value>{"referenceDataPathLocation":"HDFS",
            "dataDir":"./referenceData",
            "dataDownloader":{"dataDownloader":"DC"}}</value>
            <description>Pass reference data details as JSON format. 
            Pass above format for DATA DOWNLOADER when data is in HDFS</description>
     </property>
  • Reference data is placed on Hadoop Distributed File System (HDFS) and you want to download it to a local path:
    In the Json string, specify these details:
    • referenceDataPathLocation: HDFS
    • dataDir: Path of the reference data on HDFS
    • dataDownloader: HDFS
    • localFSRepository: Path where the reference data needs to be downloaded locally.
    <property>
            <name>pb.bdq.reference.data</name>
            <value>{"referenceDataPathLocation":"HDFS",
             "dataDir":"/home/data/dm/referenceData",
             "dataDownloader":{"dataDownloader":"HDFS",
             "localFSRepository":"/local/download"}}</value>
            <description>Pass reference data details as JSON format.</description>
    </property>