Output
The Read from Documents stage has two outgoing ports. One port captures the data that was read by the stage and returned based on the criteria entered. It can include plain text or metadata (such as author, language, date created, and so on). This port can be connected to any stage that reads incoming data, such as Write to File or Write to XML, as well as primary stages such as Validate Address or Write to Search Index. It can also be connected to the Information Extractor stage if you want to return information about certain entity types that are in the document. When you select the Document extraction type the output will contain flat data; when you select the Page or Selection extraction type the output will contain hierarchical data.
The other port collects any records that the dataflow did not process correctly. This is called the Error Port, and records that pass through this port into the sink are considered malformed. Capturing malformed records can help you identify the problem with those records. When you attach a sink to the Error Port, the resulting output file will contain all the fields from the malformed records. It will also contain a Reason field that specifies why the record failed.
Field Name |
Description / Valid Values |
---|---|
Author |
Typically contains the name of the person who created or updated the document. This information is part of the document's metadata. |
Bookmark |
Contains all the bookmarks from the PDF input file. For Bookmarks extraction types only. |
BookmarkNo |
Contains all the bookmarks from the PDF input file. For Bookmarks extraction types only. |
ContentLength |
Indicates the length of the document. This value varies depending on the extraction type selected:
|
Contents |
Varies based on extraction type. For example, Document extraction types will output the entire document as flat data. Page, Selection, and Bookmarks extraction types will output hierarchical data. |
ContentType |
Indicates the type of document that was read, such as PDF, .txt, and so on. |
Creator |
Typically ontains the name of the person who created the document. This information is part of the document's metadata. |
Date |
Indicates the date the document was created or last updated. |
Keywords |
Contains any keywords that were provided in the document's metadata. |
Language |
Indicates the language in which the document was written. |
NPages |
Indicates the number of pages in the document. |
PageContents |
Contains the contents of the selected page(s). For Page extraction types only. |
PageNo |
Contains the page number for the bookmark. For Page extraction types only. |
Parent |
Contains the path of the bookmark, similar to XPath of an XML file. For Bookmarks extraction types only. |
ResourceName |
Indicates the file name of the document. |
SectionContents |
Contains the contents of the selected section. For Selection extraction types only. |
SectionNo |
Indicates the number of that section within the document. For Selection extraction types only. |
Subject |
Contains the subject of the document that was provided in the document's metadata. |
Title |
Contains the title of the document that was provided in the document's metadata. |