Write to Hive File
You can select any of these supported Hive file formats for the output file: ORC, Parquet, and Avro.
- Connectivity to HDFS and Hive from Spectrum on Windows
- Support and connectivity to Hadoop 3.x from Spectrum with high availability
- Kerberos-enabled HDFS connectivity through Windows
- Support of Datetime datatype in the Parquet file format
Also see Configuring HDFS Connection for HA Cluster and Best Practices for connecting to HDFS 3.x and Hive 2.1.1.
Connecting to Hadoop: To be able to use Write to Hive File stage, you need to create a connection to the Hadoop file server. Once you do that, the name by which you save the connection is displayed as the server name.
File Properties tab
Fields | Description |
---|---|
Server name | Indicates that the file selected in the File name field is located on the Hadoop system. Once you select a file located on a Hadoop system, the Server name reflects the name of the respective file server, as specified in Spectrum Management Console. |
File name | Click the ellipses button (...) to browse to the output Hive
file to be created in the defined Hadoop file server. The output data of this stage is
written to the selected file. Note: You need to create a connection to the Hadoop file server
in Spectrum Management Console before using it in the stage. |
File type | Select one of these supported Hive file formats:
|
Fields | Description |
---|---|
Buffer size | Defines the buffer size to be allocated while writing to an ORC file. This is
specified in kilobytes. Note: The default buffer size is 256
KB. |
Stripe size | Defines the size of stripes to be created while writing to an ORC file. This is
specified in megabytes. Note: The default stripe size is 64
MB. |
Row index stride | Defines the number of rows to be written between two consecutive row index
entries. Note: The default Row Index Stride is 10000
rows. |
Compression type | Defines the compression type to be used while writing to an ORC file. The compression
types available are ZLIB and SNAPPY. Note: The default compression
type is ZLIB. |
Padding | Indicates whether the stripes are padded to minimize stripes that cross HDFS block
boundaries, while writing to an ORC file. Note: By default, the
Padding checkbox is selected. |
Preview | The first 50 records of the written file are fetched and displayed in the Preview grid, after the dataflow is run at least once and the data has been written to the selected file. |
Fields | Description |
---|---|
Compression type | Defines the compression type to be used while writing to a PARQUET file. The
compression types available are UNCOMPRESSED,
GZIP and SNAPPY. Note: The default compression
type is UNCOMPRESSED. |
Block size | Defines the size of block to be created while writing to a PARQUET file. This is
specified in megabytes. Note: The default block size is 128 MB.
|
Page size | The page size is for compression. When reading, each page can be decompressed
independently. This is specified in kilobytes. Note: The default page size is
1024 KB. |
Enable dictionary | To enable/disable dictionary encoding. Attention: The dictionary must be enabled for the Dictionary Page Size to be
enabled.
Note: The default is true.
|
Dictionary Page size | There is one dictionary page per column per row group when dictionary encoding is
used. The dictionary page size functions like the page size. This is specified in
kilobytes. Note: The default dictionary Page size is 1024
KB. |
Writer version | Parquet supports two writer API versions: PARQUET_1_0 and
PARQUET_2_0. Note: The default is
PARQUET_1_0. |
Preview | The first 50 records of the written file are fetched and displayed in the Preview grid, after the dataflow is run at least once and the data has been written to the selected file. |
Fields | Description |
---|---|
Sync Interval (in Bytes) | Specifies the approximate number of uncompressed bytes to be written in each block. The
valid values range from 32 to 2^30. However, it is suggested to keep the sync interval
between 2K and 2M. Note: The default sync interval is
16000. |
Compression | Defines the compression type to be used while writing to an Avro file. The compression
types available are NONE, SNAPPY and
DEFLATE. Choosing DEFLATE compression gives
you an additional option of selecting the compression level (described below). Note: The
default compression type is NONE. |
Compression level |
This field is displayed if you select the DEFLATE option in the Compression field above. It can have values ranging from 0 to 9, where 0 denotes no compression. The compression level increases from 1 to 9, with a simultaneous increase in the time taken to compress the data. Note: The default compression level is 1.
|
Preview | The first 50 records of the written file are fetched and displayed in this grid, after the dataflow is run at least once and the data is written to the selected file. |
Fields tab
The Fields tab defines the names and types of the fields as present in the source file of this stage, and to be selected to be written to the output file.
For more information, see Defining Fields for Writing to Hive File.
Runtime tab
The Runtime tab provides the option to Overwrite an existing file of the same name in the configured Hadoop file server. If you check the Overwrite checkbox, then on running the dataflow, the new output Hive file overwrites any existing file of the same name in the same Hadoop file server.