Output

The Read from Documents stage has two outgoing ports. One port captures the data that was read by the stage and returned based on the criteria entered. It can include plain text or metadata (such as author, language, date created, and so on). This port can be connected to any stage that reads incoming data, such as Write to File or Write to XML, as well as primary stages such as Validate Address or Write to Search Index. It can also be connected to the Information Extractor stage if you want to return information about certain entity types that are in the document. When you select the Document extraction type the output will contain flat data; when you select the Page or Selection extraction type the output will contain hierarchical data.

The other port collects any records that the dataflow did not process correctly. This is called the Error Port, and records that pass through this port into the sink are considered malformed. Capturing malformed records can help you identify the problem with those records. When you attach a sink to the Error Port, the resulting output file will contain all the fields from the malformed records. It will also contain a Reason field that specifies why the record failed.

Table 1. Unstructured Reader Output
Field Name	Description / Valid Values
Author	Typically contains the name of the person who created or updated the document. This information is part of the document's metadata.
Bookmark	Contains all the bookmarks from the PDF input file. For Bookmarks extraction types only.
BookmarkNo	Contains all the bookmarks from the PDF input file. For Bookmarks extraction types only.
ContentLength	Indicates the length of the document. This value varies depending on the extraction type selected: Document The number of pages in the document. Page "1", to represent the single page of content.
Contents	Varies based on extraction type. For example, Document extraction types will output the entire document as flat data. Page, Selection, and Bookmarks extraction types will output hierarchical data.
ContentType	Indicates the type of document that was read, such as PDF, .txt, and so on.
Creator	Typically ontains the name of the person who created the document. This information is part of the document's metadata.
Date	Indicates the date the document was created or last updated.
Keywords	Contains any keywords that were provided in the document's metadata.
Language	Indicates the language in which the document was written.
NPages	Indicates the number of pages in the document.
PageContents	Contains the contents of the selected page(s). For Page extraction types only.
PageNo	Contains the page number for the bookmark. For Page extraction types only.
Parent	Contains the path of the bookmark, similar to XPath of an XML file. For Bookmarks extraction types only.
ResourceName	Indicates the file name of the document.
SectionContents	Contains the contents of the selected section. For Selection extraction types only.
SectionNo	Indicates the number of that section within the document. For Selection extraction types only.
Subject	Contains the subject of the document that was provided in the document's metadata.
Title	Contains the title of the document that was provided in the document's metadata.