Connecting to Hadoop
Connect to the Hadoop system to use the stages, such as
Read from Hadoop Sequence File, Write to Hadoop Sequence File, Read From File, Write to File, Read From XML, Write to XML, Read From Hive File, Write to Hive File, and Read from HL7 File, in Spectrum Enterprise Designer.
Attention: Spectrum Technology Platform does not support Hadoop
2.x for Kerberos on Windows platforms.
Follow these steps to connect to the Hadoop system:
-
Access the Connections page using one of these:
- Spectrum Management Console:
- Access Spectrum Management Console using the URL:
http://server:port/management
console, where server is the server name or IP
address of your Spectrum Technology Platform server and
port is the HTTP port used by Spectrum Technology Platform.Note: By default, the HTTP port is 8080.
- Spectrum Discovery:
- Access Spectrum Discovery using the URL: http://server:port/discovery, where server is the server name or IP address of your Spectrum Technology Platform server and port is the HTTP port used by Spectrum Technology Platform.Note: By default, the HTTP port is 8080.
- Click the Add connection button .
-
In the Connection Name box, enter a name for the connection. The name can be anything you choose.
Note: Once you save a connection you cannot change the name.
- In the Connection Type field, choose HDFS
- In the Host field, enter the host name or IP address of the NameNode in the HDFS cluster.
- In the Port field, enter the network port number.
-
In User, select one of these options:
- Server user
- Choose this option if authentication is enabled in your HDFS cluster. This option will use the user credentials that the Spectrum Technology Platform server runs under to authenticate to HDFS.
- User name
- Choose this option if authentication is disabled in your HDFS cluster.
- Check Kerberos if you wish to enable Kerberos authentication feature for this HDFS file server connection.
-
If you have opted to enable Kerberos authentication,
then enter the path of the keytab file in the Keytab file
path field.
Note: Ensure the key tab file is placed on the Spectrum Technology Platform server.
-
In the Protocol field, select one of:
- WEBHDFS
- Select this option if the HDFS cluster is running HDFS 1.0 or later. This protocol supports both read and write operations.
- HFTP
- Select this option if the HDFS cluster is running a version older than HDFS 1.0, or if your organization does not allow the WEBHDFS protocol. This protocol only supports the read operation.
- HAR
- Select this option to access Hadoop archive files. If you choose this option, specify the path to the archive file in the Path field. This protocol only supports the read operation.
- Expand the Advanced options.
-
If you selected the WEBHDFS protocol, you can specify these advanced options as
required:
- Replication factor
- Specifies how many data nodes to replicate each block to. For example, the default setting of 3 replicates each block to three different nodes in the cluster. The maximum replication factor is 1024.
- Block size
- Specifies the size of each block. HDFS breaks up a file into blocks of the size you specify here. For example, if you specify the default 64 MB, each file is broken up into 64 MB blocks. Each block is then replicated to the number of nodes in the cluster specified in the Replication factor field.
- File permissions
- Specifies the level of access to files written to the HDFS cluster
by Spectrum Technology Platform. You can specify read and write
permissions for each of these options:Note: The Execute permission is not applicable to Spectrum Technology Platform.
- User
- This is the user specified above, either Server user or the user specified in the User name field.
- Group
- This refers to any group of which the user is a member. For example, if the user is john123, then Group permissions apply to any group of which john123 is a member.
- Other
- This refers to any other users as well as groups of which the specified user is not a member.
-
Use the File permissions descriptions below to define
the server properties for Hadoop to ensure that the sorting and filtering
features work as desired when the connection is used in a stage or activity. To
add properties, complete one of these steps:
- Click and add the properties and their respective values in the Property and Value fields.
- Click
and upload your configuration XML file. The XML file should be similar to
hdfs-site.xml, yarn-site.xml, or
core-site.xml.Note: Place the configuration file on the server.
This section applies to this stage and activity:- Stage - Read from Sequence File
- Activity - Run Hadoop Pig
- fs.default.name
- Specifies the node and port on which Hadoop runs. For example, hdfs://152.144.226.224:9000
- mapred.job.tracker
- Specifies the host name or IP address, and port on which the MapReduce job tracker runs. If the host name is entered as local, then jobs are run as a single map and reduce task. For example, 152.144.226.224:9001
- dfs.namenode.name.dir
- Specifies where on the local files system a DFS name node should store the name table. If this is a comma-delimited list of directories, then the name table is replicated in all of the directories, for redundancy. For example, file:/home/hduser/Data/namenode
- hadoop.tmp.dir
- Specifies the base location for other temporary directories. For example, /home/hduser/Data/tmp
File permissions and parameters - Hadoop 2.xThis section applies to this stage and activity:- Stage - Read from Sequence File
- Activity - Run Hadoop Pig
- fs.defaultFS
- Specifies the node and port on which Hadoop runs. For example, hdfs://152.144.226.224:9000.
- yarn.resourcemanager.resource-tracker.address
- Specifies the host name or IP address of the Resource Manager. For example, 152.144.226.224:8025
- yarn.resourcemanager.scheduler.address
- Specifies the address of the Scheduler Interface. For example, 152.144.226.224:8030
- yarn.resourcemanager.address
- Specifies the address of the Applications Manager interface that is contained in the Resource Manager. For example, 152.144.226.224:8041
- mapreduce.jobhistory.address
- Specifies the host name or IP address, and port on which the MapReduce Job History Server is running. For example, 152.144.226.224:10020
- mapreduce.application.classpath
- Specifies the CLASSPATH for Map Reduce applications. This CLASSPATH denotes the location where classes related to Map Reduce applications are found. The entries should be comma separated.
- mapreduce.app-submission.cross-platform
- Handles various platform issues that arise if your Spectrum server runs
on a Windows machine, and you install Cloudera on it. If your Spectrum
server and Cloudera are running on different Operating Systems, then
enter the value of this parameter as true.
Otherwise, mark it as false.Note: Cloudera does not support Windows clients. Configuring this parameter is a workaround, and not a solution to all resulting platform issues.
File permissions and parameters - KerberosThis section applies to this stage and activity:- Stage - Read from Sequence File
- Activity - Run Hadoop Pig
- hadoop.security.authentication
- The type of authentication security being used. Enter the value kerberos.
- yarn.resourcemanager.principal
- The Kerberos principal being used for the resource manager for your Hadoop YARN resource negotiator. For example: yarn/_HOST@HADOOP.COM
- dfs.namenode.kerberos.principal
- The Kerberos principal being used for the namenode of your Hadoop Distributed File System (HDFS). For example, hdfs/_HOST@HADOOP.COM
- dfs.datanode.kerberos.principal
- The Kerberos principal being used for the data node of your Hadoop Distributed File System (HDFS). For example, hdfs/_HOST@HADOOP.COM
File permissions and parameters - Hadoop 1.xThis section applies to these stages:- Stage Read from File
- Stage Write to File
- Stage Read from Hive ORC File
- Stage Write to Hive ORC File
- fs.default.name
- Specifies the node and port on which Hadoop runs. For example, hdfs://152.144.226.224:9000
File permissions and parameters - Hadoop 2.xThis section applies to these stages:- Stage Read or write from File
- Stage Read or write from Hive ORC File
- fs.defaultFS
- Specifies the node and port on which Hadoop runs. For example, hdfs://152.144.226.224:9000
- To test the connection, click Test.
- Click Save.
After you have defined a connection to an HDFS cluster, it becomes available in source and sink stages in Spectrum Enterprise Designer, such as Read from File and Write to File. You can select the HDFS cluster when you click Remote Machine when defining a file in a source or sink stage.