Connecting to Hadoop

In order for Spectrum™ Technology Platform to access data in Hadoop, you must define a connection to Hadoop using Management Console. Once you do this, you can create flows in Enterprise Designer that can read data from, and write data to, Hadoop.

Attention: Spectrum™ Technology Platform does not support Hadoop 2.x for Kerberos on Windows.
  1. Open Management Console.
  2. Go to Resources > Data Sources.
  3. Click the Add button .
  4. In the Name field, enter a name for the connection. The name can be anything you choose.
    Note: Once you save a connection you cannot change the name.
  5. In the Type field, choose HDFS.
  6. In the Host field, enter the hostname or IP address of the NameNode in the HDFS cluster.
  7. In the Port field, enter the network port number.
  8. In the User field, select the method for authenticating to HDFS:
    Server user
    Select this option if authentication is enabled in your HDFS cluster. This option will use the user credentials that the Spectrum™ Technology Platform server runs under to authenticate to HDFS.
    User name
    Select this option if authentication is disabled in your HDFS cluster.
  9. Check Kerberos if you want to enable Kerberos authentication feature for this HDFS file server connection.
  10. If you have opted to enable Kerberos authentication, then enter the path of the keytab file in the Keytab file path field.
    Note: The keytab file must be on the Spectrum™ Technology Platform server.
  11. In the Protocol field, select the method of communication with HDFS:
    WEBHDFS
    Select this option if the HDFS cluster is running HDFS 1.0 or later. This protocol supports both read and write operations.
    HFTP
    Select this option if the HDFS cluster is running a version older than HDFS 1.0, or if your organization does not allow the WEBHDFS protocol. This protocol only supports the read operation.
    HAR
    Select this option to access Hadoop archive files. If you choose this option, specify the path to the archive file in the Path field. This protocol only supports the read operation.
  12. If you selected the WEBHDFS protocol, expand Advanced server options. Review the settings and make any changes that are necessary.
    Replication factor
    Specifies how many data nodes to replicate each block to. For example, the default setting of 3 replicates each block to three different nodes in the cluster. The maximum replication factor is 1024.
    Block size
    Specifies the size of each block. HDFS breaks up a file into blocks of the size you specify here. For example, if you specify the default 64 MB, each file is broken up into 64 MB blocks. Each block is then replicated to the number of nodes in the cluster specified in the Replication factor field.
    File permissions
    Specifies the level of access to files written to the HDFS cluster by Spectrum™ Technology Platform. You can specify read and write permissions for each of these options:
    Note: The Execute permission is not applicable to Spectrum™ Technology Platform.
    User
    This is the user specified above, either Server user or the user specified in the User name field.
    Group
    This refers to any group of which the user is a member. For example, if the user is john123, then Group permissions apply to any group of which john123 is a member.
    Other
    This refers to any other users as well as groups of which the specified user is not a member.

    In the grid below the File permissions table, specify the server properties for Hadoop to ensure that the sorting and filtering features work as desired when the connection is used in a stage or activity.

    To add a new property, click . Then, define the properties, as described in this table, depending on the stage or activity that will use the Hadoop connection, and whether Hadoop 1.x or Hadoop 2.x is being used.

    Stage or Activity using the HDFS Connection Required Server Properties
    • Stage Read from Sequence File
    • Activity Run Hadoop Pig
    Hadoop 1.x Parameters
    fs.default.name
    Specifies the node and port on which Hadoop runs.

    For example, hdfs://152.144.226.224:9000

    mapred.job.tracker
    Specifies the hostname or IP address, and port on which the MapReduce job tracker runs. If the host name is entered as local, then jobs are run as a single map and reduce task.

    For example, 152.144.226.224:9001

    dfs.namenode.name.dir
    Specifies where on the local files system a DFS name node should store the name table. If this is a comma-delimited list of directories, then the name table is replicated in all of the directories, for redundancy.

    For example, file:/home/hduser/Data/namenode

    dfs.datanode.data.dir
    Specifies where on the local file system a DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all the named directories that are usually on different devices. Directories that do not exist are ignored.

    For example, file:/home/hduser/Data/datanode

    hadoop.tmp.dir
    Specifies the base location for other temporary directories.

    For example, /home/hduser/Data/tmp

    Hadoop 2.x Parameters

    fs.defaultFS
    Specifies the node and port on which Hadoop runs.

    For example, hdfs://152.144.226.224:9000.

    Note: We recommend that the parameter name fs.defaultFS be used in Spectrum™ Technology Platform 11 SP1 and later.
    yarn.resourcemanager.resource-tracker.address
    Specifies the hostname or IP-address of the Resource Manager.

    For example, 152.144.226.224:8025

    yarn.resourcemanager.scheduler.address
    Specifies the address of the Scheduler Interface.

    For example, 152.144.226.224:8030

    yarn.resourcemanager.address
    Specifies the address of the Applications Manager interface that is contained in the Resource Manager.

    For example, 152.144.226.224:8041

    mapreduce.jobhistory.address
    Specifies the host name or IP address, and port on which the MapReduce Job History Server is running.

    For example, 152.144.226.224:10020

    mapreduce.application.classpath
    Specifies the CLASSPATH for Map Reduce applications. This CLASSPATH denotes the location where classes related to Map Reduce applications are found. Separate entries with a comma.
    For example

    $HADOOP_CONF_DIR,
    $HADOOP_COMMON_HOME/share/hadoop/common/*,
    $HADOOP_COMMON_HOME/share/hadoop/common/lib/*,
    $HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
    $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,
    $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,
    $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,
    $HADOOP_YARN_HOME/share/hadoop/yarn/*,
    $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*

    mapreduce.app-submission.cross-platform
    Handles various platform issues that arise if your Spectrum server runs on a Windows machine and you install Cloudera on it. If your Spectrum server and Cloudera are running on different operating systems, then enter the value of this parameter as true. If not, mark it as false.
    Note: Cloudera does not support Windows clients. Configuring this parameter is a workaround and not a solution platform-related issues.

    If you have checked the Kerberos checkbox, then add these Kerberos configuration properties:

    hadoop.security.authentication
    Specifies the type of authentication security being used. Enter the value kerberos.
    yarn.resourcemanager.principal
    Specifies the Kerberos principal being used for the resource manager for your Hadoop YARN resource negotiator.

    For example, yarn/_HOST@HADOOP.COM

    dfs.namenode.kerberos.principal
    Specifies the Kerberos principal being used for the namenode of your Hadoop Distributed File System (HDFS).

    For example, hdfs/_HOST@HADOOP.COM

    dfs.datanode.kerberos.principal
    Specifies the Kerberos principal being used for the datanode of your Hadoop Distributed File System (HDFS).

    For example, hdfs/_HOST@HADOOP.COM

    • Stage Read from File
    • Stage Write to File
    • Stage Read from Hive ORC File
    • Stage Write to Hive ORC File
    Hadoop 1.x Parameters
    fs.default.name
    Specifies the node and port on which Hadoop runs.

    For example, hdfs://152.144.226.224:9000

    Hadoop 2.x Parameters

    fs.defaultFS
    Specifies the node and port on which Hadoop runs.

    For example, hdfs://152.144.226.224:9000.

    Note: It is recommended that the parameter name fs.defaultFS be used Spectrum™ Technology Platform 11 SP1 onwards.
    Table 1. Properties for Read from File, Write to File, Read from Hive ORC File, and Write to Hive ORC File
    Hadoop 1.x Properties Hadoop 2.x Properties
    fs.default.name
    Specifies the node and port on which Hadoop runs.

    For example, hdfs://152.144.226.224:9000

    fs.defaultFS
    Specifies the node and port on which Hadoop runs.

    For example, hdfs://152.144.226.224:9000.

    Note: We recommend that the parameter name fs.defaultFS be used for Spectrum™ Technology Platform 11 SP1 onwards.
  13. To test the connection, click Test.
  14. Click Save.

After you have defined a connection to an HDFS cluster, it becomes available in source and sink stages in Enterprise Designer, such as Read from File and Write to File. You can select the HDFS cluster when you click Remote Machine when defining a file in a source or sink stage.