Read from Variable Format File

Read from Variable Format File reads data from a file containing records of varying layout. Each record is read in as a list field. You can specify the tag that indicates the parent record type, and all other record types will become list fields under the parent.

Variable format files have these characteristics:

  • Records in the file may have different fields, and different numbers of fields.
  • Each record must contain a tag (usually a number) identifying the type of record.
  • Hierarchical relationships are supported.

Example of a Variable Format File

This example shows a variable format file containing information about checking account activity for two customers, Joe Smith and Anne Johnson. In this example, the file is a delimited file that uses a comma as the field delimiter.

001   Joe,Smith,M,100 Main St,555-234-1290
100   CHK12904567,12/2/2007,6/1/2012,CHK
200   1000567,1/5/2012,Fashion Shoes,323.12
001   Anne,Johnson,F,1202 Lake St,555-222-4932
100   CHK238193875,1/21/2001,4/12/2012,CHK
200   1000232,3/5/2012,Blue Goose Grocery,132.11
200   1000232,3/8/2012,Trailway Bikes,540.00

The first field in each record contains the tag which identifies the type of record and therefore the record's format:

  • 001: Customer record
  • 100: Account record
  • 200: Account transaction record

For delimited files it is common for the tag value (001, 100, 200) to be in a fixed number of bytes at the start of the record as shown in the above example.

Each record has its own format:

  • 001: FirstName,LastName,Gender,Address,PhoneNumber
  • 100: AccountID,DateOpened,ExpirationDate,TypeOfAccount
  • 200: TransactionID,DateOfTransaction,Vendor,Amount

Record format 100 (account record) is a child of the previous 001 record, and record format 200 (account transaction record) is a child of the previous record 100 (account record). So in the example file, Joe Smith's account CHK12904567 had a transaction on 1/5/2012 in the amount of 323.12 at Fashion Shoes. Likewise, Anne Johnson's account CHK238193875 had two transactions, one on 3/5/2012 at Blue Goose Grocery and one on 3/8/2012 at Trailway Bikes.

File Properties Tab

Option Name Description

Server name

Indicates whether the file you select as input is located on the computer running Enterprise Designer or on the Spectrum™ Technology Platform server. If you select a file on the local computer, the server name will be My Computer. If you select a file on the server the server name will be Spectrum™ Technology Platform.

File name

Specifies the path to the file. Click the ellipses button (...) to go to the file you want.

You can read multiple files by using a wild card character to read data from multiple files in the directory. The wild card characters * and ? are supported. For example, you could specify *.csv to read in all files with a .csv extension located in the directory. In order to successfully read multiple files, each file must have the same layout (the same fields in the same positions). Any record that does not match the layout specified on the Fields tab will be treated as a malformed record.

While reading a file from an HDFS file server, the compression formats supported are:
  1. GZIP (.gz)
  2. BZIP2 (.bz2)
Note: The extension of the file indicates the compression format to be used to decompress the file.
Attention: If the Spectrum™ Technology Platform server is running on Unix or Linux, remember that file names and paths on these platforms are case sensitive.

Record type

The format of the records in the file. Select one of:

Line Sequential
A text file in which records are separated by an end-of-line (EOL) character such as a carriage return or line feed (CR or LF) and each field has a fixed starting and ending character position.
Fixed Width
A text file in which each record is a specific number of characters in length and each field has a fixed starting and ending character position.
Delimited
A text file in which records are separated by an end-of-line (EOL) character such as a carriage return or line feed (CR or LF) and each field is separated by a designated character such as a comma.

Character encoding

The text file's encoding. Select one of these:

UTF-8
Supports all Unicode characters and is backwards-compatible with ASCII. For more information about UTF, see unicode.org/faq/utf_bom.html.
UTF-16
Supports all Unicode characters but is not backwards-compatible with ASCII. For more information about UTF, see unicode.org/faq/utf_bom.html.
US-ASCII
A character encoding based on the order of the English alphabet.
UTF-16BE
UTF-16 encoding with big endian byte serialization (most significant byte first).
UTF-16LE
UTF-16 encoding with little endian byte serialization (least significant byte first).
ISO-8859-1
An ASCII character encoding typically used for Western European languages. Also known as Latin-1.
ISO-8859-3
An ASCII character encoding typically used for Southern European languages. Also known as Latin-3.
ISO-8859-9
An ASCII character encoding typically used for Turkish language. Also known as Latin-5.
CP850
An ASCII code page used to write Western European languages.
CP500
An EBCDIC code page used to write Western European languages.
Shift_JIS
A character encoding for the Japanese language.
MS932
A Microsoft's extension of Shift_JIS to include NEC special characters, NEC selection of IBM extensions, and IBM extensions.
CP1047
An EBCDIC code page with the full Latin-1 character set.

Record length

For fixed width files, specifies the exact number of characters in each record.

Field separator

Specifies the character used to separate fields in a delimited file.

For example, this record uses a pipe (|) as a field separator:

7200 13TH ST|MIAMI|FL|33144

These characters available to define as field separators are:

  • Space
  • Tab
  • Comma
  • Period
  • Semicolon
  • Pipe

If the file uses a different character as a field separator, click the ellipses button to select another character as a delimiter.

Tag separator Specifies the character placed after the tag field to demarcate the identifying field for each record in a delimited file. A tag separator must be a single character.

By default, these characters are available to be selected as tag separators:

  • Space
  • Tab
  • Comma
  • Period
  • Semicolon
  • Pipe

If the file uses a different character as a tag separator, click the ellipses button to add and select a custom tag separator.

Note: By default, the Record separator character is the same as the selected Field separator character. To enable this field and select a different character, uncheck the Same as Field separator checkbox.
Same as Field separator Indicates if the tag separator is the same as the field separator. Uncheck this to select a different character as the tag separator.
Note: By default, this checkbox is checked and the Tag separator field is disabled.

Text qualifier

The character used to surround text values in a delimited file.

For example, this record uses double quotes (") as a text qualifier.

"7200 13TH ST"|"MIAMI"|"FL"|"33144"

The characters available to define as text qualifiers are:

  • Single quote (')
  • Double quote (")

If the file uses a different text qualifier, click the ellipses button to select another character as a text qualifier.

Record separator

Specifies the character used to separate records in line a sequential or delimited file. This field is not available if you check the Use default EOL check box.

The record separator settings available are:

Unix (U+000A)
A line feed character separates the records. This is the standard record separator for Unix systems.
Macintosh (U+000D)
A carriage return character separates the records. This is the standard record separator for Macintosh systems.
Windows (U+000D U+000A)
A carriage return followed by a line feed separates the records. This is the standard record separator for Windows systems.

If your file uses a different record separator, click the ellipses button to select another character as a record separator.

Root tag name

The tag to use for records that are a parent of other record types. For example if you have three record types 001, 100, and 200, and record types 100 and 200 are children of record type 001, then 001 is the root tag.

Use fixed-width tags

Specifies whether to allocate a fixed amount of space at the beginning of each record in which to place the record tag. For example, the following shows a file with the tags 001, 100, and 200 in a fixed-width field:

001   Joe,Smith,M,100 Main St,555-234-1290
100   CHK12904567,12/2/2007,6/1/2012,CHK
200   1000567,1/5/2012,Mike's Shoes,323.12

Tag start position

If you check the Use fixed-width tags box, this option specifies the position in each record where the tag begins. For example, if the tag begins in the fourth character in the record, you would specify 4.

Tag width

If you check the Use fixed-width tags box, this option specifies the number of spaces to allocate for tags starting from the position specified in the Tag start position field. For example, if you specify 3 in the Tag start position field and you specify 7 in the Tag width field, then positions 4 though 10 would be considered the record tag. The value you specify must be large enough to include all the characters of the longest tag name.

The value in the Tag width field is automatically increased if you lengthen the tag name in the Root tag name field.

The maximum tag width is 1024.

Use default EOL

Specifies that the file's record separator is the default end of line (EOL) character used on the operating system on which the Spectrum™ Technology Platform server is running.

Do not select this option if the file uses an EOL character that is different from the default EOL character used on the server's operating system. For example, if the file uses a Windows EOL but the server is running on Linux, do not check this option. Instead, select the Windows option in the Record separator field.

Treat records with fewer fields than defined as malformed

If you enable this option, child records that contain fewer fields than a complete record are considered malformed. When a malformed record is encountered, processing advances to the next root tag, ignoring all child tags in between. An exception is written to the log containing information about the malformed child records along with a line number.

Note: Records are always considered malformed in the following situations, regardless of whether you enable this option.
  • The tag is unknown
  • The line is empty
  • There is a tag with no data
  • A record with a tag that is a child of another tag appears immediately after a record with a root tag

Fields Tab

The Fields tab specifies the characteristics of each field read in from the file.

Runtime Tab

Field Name Description

File name

Displays the file name selected in the first tab.

Starting record

If you want to skip records at the beginning of the file when reading records into the dataflow, specify the first record you want to read. For example, if you want to skip the first 50 records, in a file, specify 51. The 51st record will be the first record read into the dataflow.

All records

Select this option if you want to read all records starting from the record specified in the Starting record field to the end of the file.

Max records

Select this option if you want to only read in a certain number of records starting from the record specified in the Starting record field. For example, if you want to read the first 100 records, select this option and enter 100.