Connecting to Hadoop

Connect to the Hadoop system to use the stages, such as Read from Hadoop Sequence File, Write to Hadoop Sequence File, Read From File, Write to File, Read From XML, Write to XML, Read From Hive File, Write to Hive File, and Read from HL7 File, in Spectrum Enterprise Designer.

Attention: Spectrum Technology Platform does not support Hadoop 2.x for Kerberos on Windows platforms.

Follow these steps to connect to the Hadoop system:

Access the Connections page using one of these:

Spectrum Management Console:

Access Spectrum Management Console using the URL: http://server:port/management console, where server is the server name or IP address of your Spectrum Technology Platform server and port is the HTTP port used by Spectrum Technology Platform.
Note: By default, the HTTP port is 8080.

Click Resources > Connections.

Spectrum Discovery:

Access Spectrum Discovery using the URL: http://server:port/discovery, where server is the server name or IP address of your Spectrum Technology Platform server and port is the HTTP port used by Spectrum Technology Platform.
Note: By default, the HTTP port is 8080.

Click Connect.
Click the Add connection button .
In the Connection Name box, enter a name for the connection. The name can be anything you choose.

Note: Once you save a connection you cannot change the name.
In the Connection Type field, choose HDFS
In the Host field, enter the host name or IP address of the NameNode in the HDFS cluster.
In the Port field, enter the network port number.
In User, select one of these options:

Server user

Choose this option if authentication is enabled in your HDFS cluster. This option will use the user credentials that the Spectrum Technology Platform server runs under to authenticate to HDFS.

User name

Choose this option if authentication is disabled in your HDFS cluster.
Check Kerberos if you wish to enable Kerberos authentication feature for this HDFS file server connection.
If you have opted to enable Kerberos authentication, then enter the path of the keytab file in the Keytab file path field.

Note: Ensure the key tab file is placed on the Spectrum Technology Platform server.
In the Protocol field, select one of:

WEBHDFS

Select this option if the HDFS cluster is running HDFS 1.0 or later. This protocol supports both read and write operations.

HFTP

Select this option if the HDFS cluster is running a version older than HDFS 1.0, or if your organization does not allow the WEBHDFS protocol. This protocol only supports the read operation.

HAR

Select this option to access Hadoop archive files. If you choose this option, specify the path to the archive file in the Path field. This protocol only supports the read operation.
Expand the Advanced options.
If you selected the WEBHDFS protocol, you can specify these advanced options as required:

Replication factor

Specifies how many data nodes to replicate each block to. For example, the default setting of 3 replicates each block to three different nodes in the cluster. The maximum replication factor is 1024.

Block size

Specifies the size of each block. HDFS breaks up a file into blocks of the size you specify here. For example, if you specify the default 64 MB, each file is broken up into 64 MB blocks. Each block is then replicated to the number of nodes in the cluster specified in the Replication factor field.

File permissions

Specifies the level of access to files written to the HDFS cluster by Spectrum Technology Platform. You can specify read and write permissions for each of these options:
Note: The Execute permission is not applicable to Spectrum Technology Platform.

User

This is the user specified above, either Server user or the user specified in the User name field.

Group

This refers to any group of which the user is a member. For example, if the user is john123, then Group permissions apply to any group of which john123 is a member.

Other

This refers to any other users as well as groups of which the specified user is not a member.
Use the File permissions descriptions below to define the server properties for Hadoop to ensure that the sorting and filtering features work as desired when the connection is used in a stage or activity. To add properties, complete one of these steps:
- Click and add the properties and their respective values in the Property and Value fields.
- Click and upload your configuration XML file. The XML file should be similar to hdfs-site.xml, yarn-site.xml, or core-site.xml.
  Note: Place the configuration file on the server.
File permissions and parameters - Hadoop 1.x
This section applies to this stage and activity:
- Stage - Read from Sequence File
- Activity - Run Hadoop Pig
fs.default.name

Specifies the node and port on which Hadoop runs. For example, hdfs://152.144.226.224:9000

mapred.job.tracker

Specifies the host name or IP address, and port on which the MapReduce job tracker runs. If the host name is entered as local, then jobs are run as a single map and reduce task. For example, 152.144.226.224:9001

dfs.namenode.name.dir

Specifies where on the local files system a DFS name node should store the name table. If this is a comma-delimited list of directories, then the name table is replicated in all of the directories, for redundancy. For example, file:/home/hduser/Data/namenode

hadoop.tmp.dir

Specifies the base location for other temporary directories. For example, /home/hduser/Data/tmp
File permissions and parameters - Hadoop 2.x
This section applies to this stage and activity:
- Stage - Read from Sequence File
- Activity - Run Hadoop Pig
fs.defaultFS

Specifies the node and port on which Hadoop runs. For example, hdfs://152.144.226.224:9000.

NOTE: For Spectrum versions 11.0 and earlier, the parameter name fs.defaultfs must be used. Note the case difference. For versions 11 SP1 and later, both the names fs.defaultfs and fs.defaultFS are valid. We recommend using parameter name fs.defaultFS for releases 11.0 SP1 and later.

yarn.resourcemanager.resource-tracker.address

Specifies the host name or IP address of the Resource Manager. For example, 152.144.226.224:8025

yarn.resourcemanager.scheduler.address

Specifies the address of the Scheduler Interface. For example, 152.144.226.224:8030

yarn.resourcemanager.address

Specifies the address of the Applications Manager interface that is contained in the Resource Manager. For example, 152.144.226.224:8041

mapreduce.jobhistory.address

Specifies the host name or IP address, and port on which the MapReduce Job History Server is running. For example, 152.144.226.224:10020

mapreduce.application.classpath

Specifies the CLASSPATH for Map Reduce applications. This CLASSPATH denotes the location where classes related to Map Reduce applications are found. The entries should be comma separated.

For example:

$HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*

mapreduce.app-submission.cross-platform

Handles various platform issues that arise if your Spectrum server runs on a Windows machine, and you install Cloudera on it. If your Spectrum server and Cloudera are running on different Operating Systems, then enter the value of this parameter as true. Otherwise, mark it as false.
Note: Cloudera does not support Windows clients. Configuring this parameter is a workaround, and not a solution to all resulting platform issues.
File permissions and parameters - Kerberos
This section applies to this stage and activity:
- Stage - Read from Sequence File
- Activity - Run Hadoop Pig
If you have selected the Kerberos check box, add these Kerberos configuration properties:

hadoop.security.authentication

The type of authentication security being used. Enter the value kerberos.

yarn.resourcemanager.principal

The Kerberos principal being used for the resource manager for your Hadoop YARN resource negotiator. For example: yarn/_HOST@HADOOP.COM

dfs.namenode.kerberos.principal

The Kerberos principal being used for the namenode of your Hadoop Distributed File System (HDFS). For example, hdfs/_HOST@HADOOP.COM

dfs.datanode.kerberos.principal

The Kerberos principal being used for the data node of your Hadoop Distributed File System (HDFS). For example, hdfs/_HOST@HADOOP.COM
File permissions and parameters - Hadoop 1.x
This section applies to these stages:
- Stage Read from File
- Stage Write to File
- Stage Read from Hive ORC File
- Stage Write to Hive ORC File
fs.default.name

Specifies the node and port on which Hadoop runs. For example, hdfs://152.144.226.224:9000
File permissions and parameters - Hadoop 2.x
This section applies to these stages:
- Stage Read or write from File
- Stage Read or write from Hive ORC File
fs.defaultFS

Specifies the node and port on which Hadoop runs. For example, hdfs://152.144.226.224:9000

NOTE: For Spectrum versions 11.0 and earlier, the parameter name fs.defaultfs must be used. Note the case difference. For versions 11 SP1 and later, both the names fs.defaultfs and fs.defaultFS are valid. We recommend using parameter name fs.defaultFS for releases 11.0 SP1 and later.
To test the connection, click Test.
Click Save.

After you have defined a connection to an HDFS cluster, it becomes available in source and sink stages in Spectrum Enterprise Designer, such as Read from File and Write to File. You can select the HDFS cluster when you click Remote Machine when defining a file in a source or sink stage.