Write to Hive File

The Write to Hive File stage writes the dataflow input to the specified output Hive file.

You can select any of these supported Hive file formats for the output file: ORC, Parquet, and Avro.

The stage supports reading data from and writing data to HDFS 3.x. The support includes:
  • Connectivity to HDFS and Hive from Spectrum on Windows
  • Support and connectivity to Hadoop 3.x from Spectrum with high availability
  • Kerberos-enabled HDFS connectivity through Windows
  • Support of Datetime datatype in the Parquet file format

Also see Configuring HDFS Connection for HA Cluster and Best Practices for connecting to HDFS 3.x and Hive 2.1.1.

Related task:

Connecting to Hadoop: To be able to use Write to Hive File stage, you need to create a connection to the Hadoop file server. Once you do that, the name by which you save the connection is displayed as the server name.

File Properties tab

Table 1. Common File Properties
Fields Description
Server name Indicates that the file selected in the File name field is located on the Hadoop system. Once you select a file located on a Hadoop system, the Server name reflects the name of the respective file server, as specified in Spectrum Management Console.
File name Click the ellipses button (...) to browse to the output Hive file to be created in the defined Hadoop file server. The output data of this stage is written to the selected file.
Note: You need to create a connection to the Hadoop file server in Spectrum Management Console before using it in the stage.
File type Select one of these supported Hive file formats:
  • ORC
  • Parquet
  • Avro
Table 2. ORC File Properties
Fields Description
Buffer size Defines the buffer size to be allocated while writing to an ORC file. This is specified in kilobytes.
Note: The default buffer size is 256 KB.
Stripe size Defines the size of stripes to be created while writing to an ORC file. This is specified in megabytes.
Note: The default stripe size is 64 MB.
Row index stride Defines the number of rows to be written between two consecutive row index entries.
Note: The default Row Index Stride is 10000 rows.
Compression type Defines the compression type to be used while writing to an ORC file. The compression types available are ZLIB and SNAPPY.
Note: The default compression type is ZLIB.
Padding Indicates whether the stripes are padded to minimize stripes that cross HDFS block boundaries, while writing to an ORC file.
Note: By default, the Padding checkbox is selected.
Preview The first 50 records of the written file are fetched and displayed in the Preview grid, after the dataflow is run at least once and the data has been written to the selected file.
Table 3. Parquet File Properties
Fields Description
Compression type Defines the compression type to be used while writing to a PARQUET file. The compression types available are UNCOMPRESSED, GZIP and SNAPPY.
Note: The default compression type is UNCOMPRESSED.
Block size Defines the size of block to be created while writing to a PARQUET file. This is specified in megabytes.
Note: The default block size is 128 MB.
Page size The page size is for compression. When reading, each page can be decompressed independently. This is specified in kilobytes.
Note: The default page size is 1024 KB.
Enable dictionary To enable/disable dictionary encoding.
Attention: The dictionary must be enabled for the Dictionary Page Size to be enabled.
Note: The default is true.
Dictionary Page size There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size functions like the page size. This is specified in kilobytes.
Note: The default dictionary Page size is 1024 KB.
Writer version Parquet supports two writer API versions: PARQUET_1_0 and PARQUET_2_0.
Note: The default is PARQUET_1_0.
Preview The first 50 records of the written file are fetched and displayed in the Preview grid, after the dataflow is run at least once and the data has been written to the selected file.
Table 4. Avro File Properties
Fields Description
Sync Interval (in Bytes) Specifies the approximate number of uncompressed bytes to be written in each block. The valid values range from 32 to 2^30. However, it is suggested to keep the sync interval between 2K and 2M.
Note: The default sync interval is 16000.
Compression Defines the compression type to be used while writing to an Avro file. The compression types available are NONE, SNAPPY and DEFLATE. Choosing DEFLATE compression gives you an additional option of selecting the compression level (described below).
Note: The default compression type is NONE.
Compression level

This field is displayed if you select the DEFLATE option in the Compression field above.

It can have values ranging from 0 to 9, where 0 denotes no compression. The compression level increases from 1 to 9, with a simultaneous increase in the time taken to compress the data.

Note: The default compression level is 1.
Preview The first 50 records of the written file are fetched and displayed in this grid, after the dataflow is run at least once and the data is written to the selected file.

Fields tab

The Fields tab defines the names and types of the fields as present in the source file of this stage, and to be selected to be written to the output file.

For more information, see Defining Fields for Writing to Hive File.

Runtime tab

The Runtime tab provides the option to Overwrite an existing file of the same name in the configured Hadoop file server. If you check the Overwrite checkbox, then on running the dataflow, the new output Hive file overwrites any existing file of the same name in the same Hadoop file server.

By default, the Overwrite checkbox is unchecked.
Note: If you do not select Overwrite, an exception is thrown while running the dataflow, if the file to be written has the same name as an existing file in the same Hadoop file server.