Defining Fields for Reading from Hive File
In the Fields tab of the Read from Hive File stage, the schema names, datatypes, positions, and the given names of the fields in the file are listed.
-
Click Regenerate.
For ORC, Avro, and Parquet files, this generates the schema based on the metadata of the existing file.
The grid displays the columns Name, Type, Stage Field, and Include.
The Name column displays the field name, as derived from the header record of the file.
The Type column lists the datatypes of each respective field of the file.
The stage supports these data types:
- boolean
- A logical type with two values: true and false.
- date
- A data type that contains a month, day, and year. For example, 2012-01-30 or January 30, 2012. You can specify a default date format in Spectrum Management Console.
- datetime
- A data type that contains a month, day, year, and hours, minutes,
and seconds.
For example, 2012/01/30 6:15:00 PM.
Note: Thedatetime
datatype in Spectrum maps to thetimestamp
datatype of Hive files. - double
- A numeric data type that contains both negative and positive double precision numbers between 2-1074 and (2-2-52)×21023. In E notation, the range of values is -1.79769313486232E+308 to 1.79769313486232E+308.
- bigdecimal
- A numeric data type that supports 38 decimal points of precision.
Use this data type for data that will be used in mathematical
calculations requiring a high degree of precision, especially those
involving financial data. The bigdecimal data type supports more
precise calculations than the double data type.Note: For Avro and Parquet Hive files, fields of the
decimal
datatype in the input file are converted tobigdecimal
datatype. - long
- A numeric data type that contains both negative and positive whole
numbers between -263 (-9,223,372,036,854,775,808) and
263-1 (9,223,372,036,854,775,807).Note: The
long
datatype in Spectrum maps to thebigint
datatype of Hive files. - integer
- A numeric data type that contains both negative and positive whole numbers between -231 (-2,147,483,648) and 231-1 (2,147,483,647).
- float
- A numeric data type that contains both negative and positive single precision numbers between 2-149 and (2-223)×2127. In E notation, the range of values -3.402823E+38 to 3.402823E+38.
- string
- A sequence of characters.
-
In the Stage Field column, edit the existing field name
to the desired name for each field.
By default, this column displays the field names read from the file.
-
In the Include column, select the checkboxes against the
fields you wish to include in the output of the stage.
By default, all the fields are selected in this column.
- Click OK.