DataFlux Data Management Studio 2.6: User Guide

Document Conversion Reference

The following reference materials are available for the Document Conversion node:

Input Document Formats

The following table lists the input document formats that are understood and provide some form of textual extraction from their known formats:

Document Type Extensions Description
HTML HTML, HTM Hypertext Markup Language files used for web documents
XML XML, ODF Extensible Markup Language formatted files, which is used for many different kinds of files
Excel XLS, XLSX Microsoft Excel Documents
Word DOC, DOCX Microsoft Word Documents
PowerPoint PPT Microsoft PowerPoint Documents
Publisher PUB Microsoft Publisher Documents
Visio VSD Microsoft Visio Documents
Embedded Objects   Documents embedded within other documents, such as an Excel document within a Word document
Open Document Formats ODT, ODS, ODP, ODG, ODF Open Document Format (ODF) formatted files and types, such as text (odt), spreadsheets (ods), presentations (odp), and graphics (odg)
Apple iWorks Formats PAGES, KEY, NUMBERS Apple iWorks formats, such as Numbers, Pages, and Keynote used by Appleā€™s iWorks Office Suite
Portable Document Format PDF Portable Document Format (PDF) documents
Electronic Publication Format EPUB Electronic Publication Format, which is used for many digital books
Rich Text Format RTF Rich Text Format (RTF), which is used for formatted text

There are other formats supported for metadata extraction that do not allow for textual extraction. These formats include image and audio formats such as JPG, PNG, and WAV.

File Text Extraction Fields

The following table lists text fields that can be extracted from input files:

Field Name Type Description Example(s)/Value(s)
FILENAME String The name of the document that the text originated from for this row of data. In the case of a directory read in, then each different file name would be designated within this field. C:\Word.doc
TEXT String The extracted text from the input document. This text field size will restrict a row to 64K in length. The quick brown fox jumped over the lazy dogs back.
ROW_NUM Integer The number of the row of text from this file name. This value is used when there is more than 64k of text and the row order needs to be maintained. 1

When there is no text or no text can be determined, the row outputs a NULL row. This may not be the case if the OUTPUT_NULL_ROWS is set to false.

File Metadata Extraction Fields

The following table lists metadata fields that can be extracted from input files:

Field Name Type Description Example(s)/Value(s)
FILENAME String The name of the document that the text originated from for this row of data. In the case of a directory read in, then each different file name would be designated within this field. C:\Word.doc
METADATA_KEY String The key or metadata property that is populated for this file. AUTHOR
METADATA_VALUE String This is the metadata property value. Metadata properties can be multi-value and may output multiple rows for a single key. JG

Related Topics

Document Conversion Node

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfDMStd_PFInput_DocConverRef.html