Load to Hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. To use Hive to query the underlying data source, use its own query language, HiveQL.

Hive supports these Hadoop file formats:
  • TEXTFILE
  • SEQUENCE FILE
  • ORC
  • PARQUET
  • AVRO
    Note: The AVRO file format is supported in Hive version 0.14 and higher.

The Load to Hive activity allows you to load data into a Hive table using a JDBC connection. Using this connection, data is read from a specified Hadoop file and loaded to either an existing table of a selected connection, or to a newly created table in the selected connection.

To load the data to a new table, the schema of the table needs to be defined. Spectrum does not support hierarchical data, even though Hive supports it.

Note: The stage supports reading data from and writing data to HDFS 3.x and Hive 2.1.1. The support includes:
  • Connectivity to HDFS and Hive from Spectrum on Windows
  • Support and connectivity to Hadoop 3.x from Spectrum with high availability
  • Kerberos-enabled HDFS connectivity through Windows
  • Support and connectivity to Hive version 2.1.1 from Spectrum with high availability
  • Support of Datetime datatype in the Parquet file format
  • Support to Read and Write from Hive DB (JDBC) via Model Store connection

Also see Configuring HDFS Connection for HA Cluster and Best Practices for connecting to HDFS 3.x and Hive 2.1.1.