Output

The Read from Documents stage has two outgoing ports. One port captures the data that was read by the stage and returned based on the criteria entered. It can include plain text or metadata (such as author, language, date created, and so on). This port can be connected to any stage that reads incoming data, such as Write to File or Write to XML, as well as primary stages such as Validate Address or Write to Search Index. It can also be connected to the Information Extractor stage if you want to return information about certain entity types that are in the document. When you select the Document extraction type the output will contain flat data; when you select the Page or Selection extraction type the output will contain hierarchical data.

The other port collects any records that the dataflow did not process correctly. This is called the Error Port, and records that pass through this port into the sink are considered malformed. Capturing malformed records can help you identify the problem with those records. When you attach a sink to the Error Port, the resulting output file will contain all the fields from the malformed records. It will also contain a Reason field that specifies why the record failed.

Table 1. Unstructured Reader Output

Field Name

Description / Valid Values

Author

Typically contains the name of the person who created or updated the document. This information is part of the document's metadata.

Bookmark

Contains all the bookmarks from the PDF input file. For Bookmarks extraction types only.

BookmarkNo

Contains all the bookmarks from the PDF input file. For Bookmarks extraction types only.

ContentLength

Indicates the length of the document. This value varies depending on the extraction type selected:

Document
The number of pages in the document.
Page
"1", to represent the single page of content.

Contents

Varies based on extraction type. For example, Document extraction types will output the entire document as flat data. Page, Selection, and Bookmarks extraction types will output hierarchical data.

ContentType

Indicates the type of document that was read, such as PDF, .txt, and so on.

Creator

Typically ontains the name of the person who created the document. This information is part of the document's metadata.

Date

Indicates the date the document was created or last updated.

Keywords

Contains any keywords that were provided in the document's metadata.

Language

Indicates the language in which the document was written.

NPages

Indicates the number of pages in the document.

PageContents

Contains the contents of the selected page(s). For Page extraction types only.

PageNo

Contains the page number for the bookmark. For Page extraction types only.

Parent

Contains the path of the bookmark, similar to XPath of an XML file. For Bookmarks extraction types only.

ResourceName

Indicates the file name of the document.

SectionContents

Contains the contents of the selected section. For Selection extraction types only.

SectionNo

Indicates the number of that section within the document. For Selection extraction types only.

Subject

Contains the subject of the document that was provided in the document's metadata.

Title

Contains the title of the document that was provided in the document's metadata.