Write to Hive File

The Write to Hive File stage writes the dataflow input to the specified output Hive file.

You can select any of these supported Hive file formats for the output file: ORC, Parquet, and Avro.

The stage supports reading data from and writing data to HDFS 3.x. The support includes:

Connectivity to HDFS and Hive from Spectrum on Windows
Support and connectivity to Hadoop 3.x from Spectrum with high availability
Kerberos-enabled HDFS connectivity through Windows
Support of Datetime datatype in the Parquet file format

Also see Configuring HDFS Connection for HA Cluster and Best Practices for connecting to HDFS 3.x and Hive 2.1.1.

Related task:

Connecting to Hadoop: To be able to use Write to Hive File stage, you need to create a connection to the Hadoop file server. Once you do that, the name by which you save the connection is displayed as the server name.

File Properties tab

Table 1. Common File Properties
Fields	Description
Server name	Indicates that the file selected in the File name field is located on the Hadoop system. Once you select a file located on a Hadoop system, the Server name reflects the name of the respective file server, as specified in Spectrum Management Console.
File name	Click the ellipses button (...) to browse to the output Hive file to be created in the defined Hadoop file server. The output data of this stage is written to the selected file. Note: You need to create a connection to the Hadoop file server in Spectrum Management Console before using it in the stage.
File type	Select one of these supported Hive file formats: ORC Parquet Avro

Table 2. ORC File Properties
Fields	Description
Buffer size	Defines the buffer size to be allocated while writing to an ORC file. This is specified in kilobytes. Note: The default buffer size is `256` KB.
Stripe size	Defines the size of stripes to be created while writing to an ORC file. This is specified in megabytes. Note: The default stripe size is `64` MB.
Row index stride	Defines the number of rows to be written between two consecutive row index entries. Note: The default Row Index Stride is `10000` rows.
Compression type	Defines the compression type to be used while writing to an ORC file. The compression types available are ZLIB and SNAPPY. Note: The default compression type is `ZLIB`.
Padding	Indicates whether the stripes are padded to minimize stripes that cross HDFS block boundaries, while writing to an ORC file. Note: By default, the Padding checkbox is selected.
Preview	The first 50 records of the written file are fetched and displayed in the Preview grid, after the dataflow is run at least once and the data has been written to the selected file.

Table 3. Parquet File Properties
Fields	Description
Compression type	Defines the compression type to be used while writing to a PARQUET file. The compression types available are `UNCOMPRESSED`, `GZIP` and `SNAPPY`. Note: The default compression type is `UNCOMPRESSED`.
Block size	Defines the size of block to be created while writing to a PARQUET file. This is specified in megabytes. Note: The default block size is `128` MB.
Page size	The page size is for compression. When reading, each page can be decompressed independently. This is specified in kilobytes. Note: The default page size is `1024` KB.
Enable dictionary	To enable/disable dictionary encoding. Attention: The dictionary must be enabled for the Dictionary Page Size to be enabled. Note: The default is `true`.
Dictionary Page size	There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size functions like the page size. This is specified in kilobytes. Note: The default dictionary Page size is `1024` KB.
Writer version	Parquet supports two writer API versions: `PARQUET_1_0` and `PARQUET_2_0`. Note: The default is `PARQUET_1_0`.
Preview	The first 50 records of the written file are fetched and displayed in the Preview grid, after the dataflow is run at least once and the data has been written to the selected file.

Table 4. Avro File Properties
Fields	Description
Sync Interval (in Bytes)	Specifies the approximate number of uncompressed bytes to be written in each block. The valid values range from 32 to 2^30. However, it is suggested to keep the sync interval between 2K and 2M. Note: The default sync interval is `16000`.
Compression	Defines the compression type to be used while writing to an Avro file. The compression types available are NONE, SNAPPY and DEFLATE. Choosing DEFLATE compression gives you an additional option of selecting the compression level (described below). Note: The default compression type is `NONE`.
Compression level	This field is displayed if you select the `DEFLATE` option in the Compression field above. It can have values ranging from `0` to `9`, where `0` denotes no compression. The compression level increases from `1` to `9`, with a simultaneous increase in the time taken to compress the data. Note: The default compression level is `1`.
Preview	The first 50 records of the written file are fetched and displayed in this grid, after the dataflow is run at least once and the data is written to the selected file.

Fields tab

The Fields tab defines the names and types of the fields as present in the source file of this stage, and to be selected to be written to the output file.

For more information, see Defining Fields for Writing to Hive File.

Runtime tab

The Runtime tab provides the option to Overwrite an existing file of the same name in the configured Hadoop file server. If you check the Overwrite checkbox, then on running the dataflow, the new output Hive file overwrites any existing file of the same name in the same Hadoop file server.

By default, the Overwrite checkbox is unchecked.

Note: If you do not select Overwrite, an exception is thrown while running the dataflow, if the file to be written has the same name as an existing file in the same Hadoop file server.