Document Conversion Reference

DataFlux Data Management Studio 2.6: User Guide

Document Conversion Reference

The following reference materials are available for the Document Conversion node:

Input Document Formats
File Text Extraction Fields
File Metadata Extraction Fields

Input Document Formats

The following table lists the input document formats that are understood and provide some form of textual extraction from their known formats:

Document Type	Extensions	Description
HTML	HTML, HTM	Hypertext Markup Language files used for web documents
XML	XML, ODF	Extensible Markup Language formatted files, which is used for many different kinds of files
Excel	XLS, XLSX	Microsoft Excel Documents
Word	DOC, DOCX	Microsoft Word Documents
PowerPoint	PPT	Microsoft PowerPoint Documents
Publisher	PUB	Microsoft Publisher Documents
Visio	VSD	Microsoft Visio Documents
Embedded Objects		Documents embedded within other documents, such as an Excel document within a Word document
Open Document Formats	ODT, ODS, ODP, ODG, ODF	Open Document Format (ODF) formatted files and types, such as text (odt), spreadsheets (ods), presentations (odp), and graphics (odg)
Apple iWorks Formats	PAGES, KEY, NUMBERS	Apple iWorks formats, such as Numbers, Pages, and Keynote used by Apple’s iWorks Office Suite
Portable Document Format	PDF	Portable Document Format (PDF) documents
Electronic Publication Format	EPUB	Electronic Publication Format, which is used for many digital books
Rich Text Format	RTF	Rich Text Format (RTF), which is used for formatted text

There are other formats supported for metadata extraction that do not allow for textual extraction. These formats include image and audio formats such as JPG, PNG, and WAV.

File Text Extraction Fields

The following table lists text fields that can be extracted from input files:

Field Name	Type	Description	Example(s)/Value(s)
FILENAME	String	The name of the document that the text originated from for this row of data. In the case of a directory read in, then each different file name would be designated within this field.	C:\Word.doc
TEXT	String	The extracted text from the input document. This text field size will restrict a row to 64K in length.	The quick brown fox jumped over the lazy dogs back.
ROW_NUM	Integer	The number of the row of text from this file name. This value is used when there is more than 64k of text and the row order needs to be maintained.	1

When there is no text or no text can be determined, the row outputs a NULL row. This may not be the case if the OUTPUT_NULL_ROWS is set to false.

File Metadata Extraction Fields

The following table lists metadata fields that can be extracted from input files:

Field Name	Type	Description	Example(s)/Value(s)
FILENAME	String	The name of the document that the text originated from for this row of data. In the case of a directory read in, then each different file name would be designated within this field.	C:\Word.doc
METADATA_KEY	String	The key or metadata property that is populated for this file.	AUTHOR
METADATA_VALUE	String	This is the metadata property value. Metadata properties can be multi-value and may output multiple rows for a single key.	JG