Connecting to Hadoop

In order for Spectrum™ Technology Platform to access data in Hadoop, you must define a connection to Hadoop using Management Console. Once you do this, you can create flows in Enterprise Designer that can read data from, and write data to, Hadoop.

Attention: Spectrum™ Technology Platform does not support Hadoop 2.x for Kerberos on Windows.

Open Management Console.
Go to Resources > Data Sources.
Click the Add button .
In the Name field, enter a name for the connection. The name can be anything you choose.

Note: Once you save a connection you cannot change the name.
In the Type field, choose HDFS.
In the Host field, enter the hostname or IP address of the NameNode in the HDFS cluster.
In the Port field, enter the network port number.
In the User field, select the method for authenticating to HDFS:

Server user

Select this option if authentication is enabled in your HDFS cluster. This option will use the user credentials that the Spectrum™ Technology Platform server runs under to authenticate to HDFS.

User name

Select this option if authentication is disabled in your HDFS cluster.
Check Kerberos if you want to enable Kerberos authentication feature for this HDFS file server connection.
If you have opted to enable Kerberos authentication, then enter the path of the keytab file in the Keytab file path field.

Note: The keytab file must be on the Spectrum™ Technology Platform server.
In the Protocol field, select the method of communication with HDFS:

WEBHDFS

Select this option if the HDFS cluster is running HDFS 1.0 or later. This protocol supports both read and write operations.

HFTP

Select this option if the HDFS cluster is running a version older than HDFS 1.0, or if your organization does not allow the WEBHDFS protocol. This protocol only supports the read operation.

HAR

Select this option to access Hadoop archive files. If you choose this option, specify the path to the archive file in the Path field. This protocol only supports the read operation.

If you selected the WEBHDFS protocol, expand Advanced server options. Review the settings and make any changes that are necessary.

Replication factor

Specifies how many data nodes to replicate each block to. For example, the default setting of 3 replicates each block to three different nodes in the cluster. The maximum replication factor is 1024.

Block size

Specifies the size of each block. HDFS breaks up a file into blocks of the size you specify here. For example, if you specify the default 64 MB, each file is broken up into 64 MB blocks. Each block is then replicated to the number of nodes in the cluster specified in the Replication factor field.

File permissions

Specifies the level of access to files written to the HDFS cluster by Spectrum™ Technology Platform. You can specify read and write permissions for each of these options:

Note: The Execute permission is not applicable to Spectrum™ Technology Platform.

User: This is the user specified above, either Server user or the user specified in the User name field.
Group: This refers to any group of which the user is a member. For example, if the user is john123, then Group permissions apply to any group of which john123 is a member.
Other: This refers to any other users as well as groups of which the specified user is not a member.

In the grid below the File permissions table, specify the server properties for Hadoop to ensure that the sorting and filtering features work as desired when the connection is used in a stage or activity.

To add a new property, click . Then, define the properties, as described in this table, depending on the stage or activity that will use the Hadoop connection, and whether Hadoop 1.x or Hadoop 2.x is being used.

Stage or Activity using the HDFS Connection	Required Server Properties
Stage Read from Sequence File Activity Run Hadoop Pig	Hadoop 1.x Parameters fs.default.name Specifies the node and port on which Hadoop runs. For example, `hdfs://152.144.226.224:9000` mapred.job.tracker Specifies the hostname or IP address, and port on which the MapReduce job tracker runs. If the host name is entered as local, then jobs are run as a single map and reduce task. For example, `152.144.226.224:9001` dfs.namenode.name.dir Specifies where on the local files system a DFS name node should store the name table. If this is a comma-delimited list of directories, then the name table is replicated in all of the directories, for redundancy. For example, `file:/home/hduser/Data/namenode` dfs.datanode.data.dir Specifies where on the local file system a DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all the named directories that are usually on different devices. Directories that do not exist are ignored. For example, `file:/home/hduser/Data/datanode` hadoop.tmp.dir Specifies the base location for other temporary directories. For example, `/home/hduser/Data/tmp` Hadoop 2.x Parameters fs.defaultFS Specifies the node and port on which Hadoop runs. For example, `hdfs://152.144.226.224:9000`. Note: We recommend that the parameter name fs.defaultFS be used in Spectrum™ Technology Platform 11 SP1 and later. yarn.resourcemanager.resource-tracker.address Specifies the hostname or IP-address of the Resource Manager. For example, `152.144.226.224:8025` yarn.resourcemanager.scheduler.address Specifies the address of the Scheduler Interface. For example, `152.144.226.224:8030` yarn.resourcemanager.address Specifies the address of the Applications Manager interface that is contained in the Resource Manager. For example, `152.144.226.224:8041` mapreduce.jobhistory.address Specifies the host name or IP address, and port on which the MapReduce Job History Server is running. For example, `152.144.226.224:10020` mapreduce.application.classpath Specifies the CLASSPATH for Map Reduce applications. This CLASSPATH denotes the location where classes related to Map Reduce applications are found. Separate entries with a comma. For example `$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/, $HADOOP_COMMON_HOME/share/hadoop/common/lib/, $HADOOP_HDFS_HOME/share/hadoop/hdfs/, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/, $HADOOP_YARN_HOME/share/hadoop/yarn/, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/` mapreduce.app-submission.cross-platform Handles various platform issues that arise if your Spectrum server runs on a Windows machine and you install Cloudera on it. If your Spectrum server and Cloudera are running on different operating systems, then enter the value of this parameter as `true`. If not, mark it as `false`. Note: Cloudera does not support Windows clients. Configuring this parameter is a workaround and not a solution platform-related issues. If you have checked the Kerberos checkbox, then add these Kerberos configuration properties: hadoop.security.authentication Specifies the type of authentication security being used. Enter the value `kerberos`. yarn.resourcemanager.principal Specifies the Kerberos principal being used for the resource manager for your Hadoop YARN resource negotiator. For example, `yarn/_HOST@HADOOP.COM` dfs.namenode.kerberos.principal Specifies the Kerberos principal being used for the namenode of your Hadoop Distributed File System (HDFS). For example, `hdfs/_HOST@HADOOP.COM` dfs.datanode.kerberos.principal Specifies the Kerberos principal being used for the datanode of your Hadoop Distributed File System (HDFS). For example, `hdfs/_HOST@HADOOP.COM`
Stage Read from File Stage Write to File Stage Read from Hive ORC File Stage Write to Hive ORC File	Hadoop 1.x Parameters fs.default.name Specifies the node and port on which Hadoop runs. For example, `hdfs://152.144.226.224:9000` Hadoop 2.x Parameters fs.defaultFS Specifies the node and port on which Hadoop runs. For example, `hdfs://152.144.226.224:9000`. Note: It is recommended that the parameter name fs.defaultFS be used Spectrum™ Technology Platform 11 SP1 onwards.

Stage or Activity using the HDFS Connection

Required Server Properties

Stage Read from Sequence File
Activity Run Hadoop Pig

Hadoop 1.x Parameters

fs.default.name: Specifies the node and port on which Hadoop runs.
For example, hdfs://152.144.226.224:9000
mapred.job.tracker: Specifies the hostname or IP address, and port on which the MapReduce job tracker runs. If the host name is entered as local, then jobs are run as a single map and reduce task.
For example, 152.144.226.224:9001
dfs.namenode.name.dir: Specifies where on the local files system a DFS name node should store the name table. If this is a comma-delimited list of directories, then the name table is replicated in all of the directories, for redundancy.
For example, file:/home/hduser/Data/namenode
dfs.datanode.data.dir: Specifies where on the local file system a DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all the named directories that are usually on different devices. Directories that do not exist are ignored.
For example, file:/home/hduser/Data/datanode
hadoop.tmp.dir: Specifies the base location for other temporary directories.
For example, /home/hduser/Data/tmp

Hadoop 2.x Parameters

fs.defaultFS: Specifies the node and port on which Hadoop runs.
For example, hdfs://152.144.226.224:9000.

Note: We recommend that the parameter name fs.defaultFS be used in Spectrum™ Technology Platform 11 SP1 and later.
yarn.resourcemanager.resource-tracker.address: Specifies the hostname or IP-address of the Resource Manager.
For example, 152.144.226.224:8025
yarn.resourcemanager.scheduler.address: Specifies the address of the Scheduler Interface.
For example, 152.144.226.224:8030
yarn.resourcemanager.address: Specifies the address of the Applications Manager interface that is contained in the Resource Manager.
For example, 152.144.226.224:8041
mapreduce.jobhistory.address: Specifies the host name or IP address, and port on which the MapReduce Job History Server is running.
For example, 152.144.226.224:10020
mapreduce.application.classpath: Specifies the CLASSPATH for Map Reduce applications. This CLASSPATH denotes the location where classes related to Map Reduce applications are found. Separate entries with a comma.
For example
$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*

mapreduce.app-submission.cross-platform: Handles various platform issues that arise if your Spectrum server runs on a Windows machine and you install Cloudera on it. If your Spectrum server and Cloudera are running on different operating systems, then enter the value of this parameter as true. If not, mark it as false.
Note: Cloudera does not support Windows clients. Configuring this parameter is a workaround and not a solution platform-related issues.

If you have checked the Kerberos checkbox, then add these Kerberos configuration properties:

hadoop.security.authentication: Specifies the type of authentication security being used. Enter the value kerberos.
yarn.resourcemanager.principal: Specifies the Kerberos principal being used for the resource manager for your Hadoop YARN resource negotiator.
For example, yarn/_HOST@HADOOP.COM
dfs.namenode.kerberos.principal: Specifies the Kerberos principal being used for the namenode of your Hadoop Distributed File System (HDFS).
For example, hdfs/_HOST@HADOOP.COM
dfs.datanode.kerberos.principal: Specifies the Kerberos principal being used for the datanode of your Hadoop Distributed File System (HDFS).
For example, hdfs/_HOST@HADOOP.COM

Stage Read from File
Stage Write to File
Stage Read from Hive ORC File
Stage Write to Hive ORC File

Hadoop 1.x Parameters

fs.default.name: Specifies the node and port on which Hadoop runs.
For example, hdfs://152.144.226.224:9000

Hadoop 2.x Parameters

fs.defaultFS: Specifies the node and port on which Hadoop runs.
For example, hdfs://152.144.226.224:9000.

Note: It is recommended that the parameter name fs.defaultFS be used Spectrum™ Technology Platform 11 SP1 onwards.

Table 1. Properties for Read from File, Write to File, Read from Hive ORC File, and Write to Hive ORC File
Hadoop 1.x Properties	Hadoop 2.x Properties
fs.default.name Specifies the node and port on which Hadoop runs. For example, `hdfs://152.144.226.224:9000`	fs.defaultFS Specifies the node and port on which Hadoop runs. For example, `hdfs://152.144.226.224:9000`. Note: We recommend that the parameter name fs.defaultFS be used for Spectrum™ Technology Platform 11 SP1 onwards.

To test the connection, click Test.
Click Save.

After you have defined a connection to an HDFS cluster, it becomes available in source and sink stages in Enterprise Designer, such as Read from File and Write to File. You can select the HDFS cluster when you click Remote Machine when defining a file in a source or sink stage.