DataFlux Data Management Studio 2.5: User Guide
The following reference materials are available for the Document Conversion node:
The following table lists the input document formats that are understood and provide some form of textual extraction from their known formats:
Document Type | Extensions | Description |
---|---|---|
HTML | HTML, HTM | Hypertext Markup Language files used for web documents |
XML | XML, ODF | Extensible Markup Language formatted files, which is used for many different kinds of files |
Excel | XLS, XLSX | Microsoft Excel Documents |
Word | DOC, DOCX | Microsoft Word Documents |
PowerPoint | PPT | Microsoft PowerPoint Documents |
Publisher | PUB | Microsoft Publisher Documents |
Visio | VSD | Microsoft Visio Documents |
Embedded Objects | Documents embedded within other documents, such as an Excel document within a Word document | |
Open Document Formats | ODT, ODS, ODP, ODG, ODF | Open Document Format (ODF) formatted files and types, such as text (odt), spreadsheets (ods), presentations (odp), and graphics (odg) |
Apple iWorks Formats | PAGES, KEY, NUMBERS | Apple iWorks formats, such as Numbers, Pages, and Keynote used by Appleās iWorks Office Suite |
Portable Document Format | Portable Document Format (PDF) documents | |
Electronic Publication Format | EPUB | Electronic Publication Format, which is used for many digital books |
Rich Text Format | RTF | Rich Text Format (RTF), which is used for formatted text |
There are other formats supported for metadata extraction that do not allow for textual extraction. These formats include image and audio formats such as JPG, PNG, and WAV.
The following table lists text fields that can be extracted from input files:
Field Name | Type | Description | Example(s)/Value(s) |
---|---|---|---|
FILENAME | String | The name of the document that the text originated from for this row of data. In the case of a directory read in, then each different filename would be designated within this field. | C:\Word.doc |
TEXT | String | The extracted text from the input document. This text field size will restrict a row to 64K in length. | The quick brown fox jumped over the lazy dogs back. |
ROW_NUM | Integer | The number of the row of text from this filename. This value is used when there is more than 64k of text and the row order needs to be maintained. | 1 |
When there is no text or no text can be determined, the row outputs a NULL row. This may not be the case if the OUTPUT_NULL_ROWS is set to false.
The following table lists metadata fields that can be extracted from input files:
Field Name | Type | Description | Example(s)/Value(s) |
---|---|---|---|
FILENAME | String | The name of the document that the text originated from for this row of data. In the case of a directory read in, then each different filename would be designated within this field. | C:\Word.doc |
METADATA_KEY | String | The key or metadata property that is populated for this file. | AUTHOR |
METADATA_VALUE | String | This is the metadata property value. Metadata properties can be multi-value and may output multiple rows for a single key. | JG |
Documentation Feedback: yourturn@sas.com
|
Doc ID: dfDMStd_PFInput_DocConverRef.html |