Read From DB

The Read From DB stage reads data from a database table or view as input to a dataflow. The stage is available for jobs, services, and subflows but not for process flows.
Note: The stage supports reading data from and writing data to HDFS 3.x and Hive 2.1.1. The support includes:
  • Connectivity to Hive from Spectrum on Windows
  • Support and connectivity to Hive version 2.1.1 from Spectrum with high availability
  • Support to Read and Write from Hive DB (JDBC) via Model Store connection

Also see Best Practices for connecting to HDFS 3.x and Hive 2.1.1.

Related task:

To be able to use the Read from DB stage, you need to create connection to the respective database using the Connection Manager of Spectrum Management Console. For details, see Database Connection Manager in Write to DB.

General Tab

Field Name

Description

Connection

Select the database connection you want to use. Your choices vary depending on what connections are defined in the Connection Manager of Spectrum Management Console. If you need to make a new database connection, or modify or delete an existing database connection, click Manage Connections.

If you are adding or modifying a database connection, complete these fields:
Connection name
Enter a name for the connection. The name can be anything you choose.
Database driver
Select the appropriate database type.
Connection options
Specify the host, port, instance, user name, and password to use to connect to the database.

SQL

Enter the SQL query to specify the records that need to be read from the data source on running the dataflow. You can manually type the SQL query in this field. Alternatively, use the Visual Query Builder to construct the query by clicking Build SQL.

The SQL query can include variables instead of actual column names. Using variables allows you to customize the query at runtime. For more information, see Query Variables.

A sample query for exposing BranchID and BranchType from public.Branch table in the database can be:
select * from "public"."Branch" where
 "BranchID" = # {ID} and "BranchType" = # {Type}
Note: For Integer type fields values can be entered without quotes but for String type it should be in single quotes.
Build SQL Create a complex query by selecting multiple columns, and creating joins and nested queries by clicking Build SQL. The Visual Query Builder opens. For more information, see Visual Query Builder.
Note: A query created using the Visual Query Builder is displayed with fully qualified names of columns and tables in the SQL field.
Regenerate Fields To see the schema of the data to be fetched by the query, click Regenerate Fields.

If you edit an existing query, click Regenerate Fields to fetch the modified schema.

Note: On clicking Regenerate Fields, the entity names in the SQL query are retained and not replaced with their fully qualified names.
Preview To see a sample of the records fetched by the SQL query, click Preview.
Note: The Read From DB stage allows you to modify the type of an input field.
Note: The Read from DB stage reads all values of the date datatype as String values. This is the behavior of the jTDS driver, which is the default driver used by Spectrum. To handle all date datatype values as is, use Microsoft's JDBC driver.

Runtime Tab

Field name Description
Fetch size

Select this option to specify the number of records to read from the database table at a time. For example, if the Fetch size value is 100 and total number of records to be read is 1000, then it would take 10 trips to the database to read all the records.

Setting an optimum fetch size can improve performance significantly.
Note: You can calculate an optimum fetch size for your environment by testing the execution times between a Read from DB stage and a Write to Null stage. For more information, see Determining an Optimimum Fetch Size.
Stage Options
This section lists the dataflow options used in the SQL query of this stage and allows you to provide a default value for all these options. The Name column lists the options while you can enter the default values in the corresponding Value column.
Note: The default value provided here is also displayed in the Map dataflow options to stages section of the Dataflow Options dialog box. The dialogue box also allows you to change the default value. In case of a clash of default values provided for an option through Stage Options, Dataflow Options, and Job Executor the order of precedence is: Value provided through Job Executor > Value defined through the Dataflow Options dialogue box > Value entered through the Stage Options.