DataFlux Data Management Studio 2.5: User Guide

Converting and Extracting a Document

The information that you need to process is not always found in traditional databases. For example, you might need to take data from a Microsoft Word file or an HTML file and convert it into a format that you can process in a DataFlux Data Management Studio job. You can address this problem with a job that converts the original source document in UTF-8 text, extracts the text, and inserts it into a text file that can be consumed later. Perform the following tasks:

Select a Source Document

You can use the Document Conversion node to convert a variety of formats into UTF-8 text. The formats listed in the following table are supported:

Document Type Extensions Description
HTML HTML, HTM Hypertext Markup Language files used for web documents
XML XML, ODF Extensible Markup Language formatted files, which is used for many different kinds of files
Excel XLS, XLSX Microsoft Excel Documents
Word DOC, DOCX Microsoft Word Documents
PowerPoint PPT Microsoft PowerPoint Documents later than 2003
Publisher PUB Microsoft Publisher Documents
Visio VSD Microsoft Visio Documents
Embedded Objects   Documents embedded within other documents, such as an Excel document within a Word document
Open Document Formats ODT, ODS, ODP, ODG, ODF Open Document Format (ODF) formatted files and types, such as text (odt), spreadsheets (ods), presentations (odp), and graphics (odg)
Apple iWorks Formats PAGES, KEY, NUMBERS Apple iWorks formats, such as Numbers, Pages, and Keynote used by Appleā€™s iWorks Office Suite
Portable Document Format PDF Portable Document Format (PDF) documents
Electronic Publication Format EPUB Electronic Publication Format, which is used for many digital books
Rich Text Format RTF Rich Text Format (RTF), which is used for formatted text

You can also extract only the metadata from a selected source file. There are other formats supported for metadata extraction only that do not allow for textual extraction. These formats include image and audio formats such as JPG, PNG, and WAV.

Convert and Extract the Text

Once you select a source, you can create a data job to run the conversion and extraction. For example,a data job can convert a Microsoft Excel file that contains staff information into a UTF-8 file. Then, it extracts the data from the UTF-8 file and inserts it into a text file, which can be used for further processing in the data job or another DataFlux Data Management Studio application. The Data Flow tab in the job is shown in the following display:

Note that the job contains the Document Conversion, Document Extraction, and Text File Output nodes. The following table lists the main configuration settings for these nodes:

Data Flow Node Setting
Document Conversion
  • Input file: C:\Data\Staff_List.xlsx
Document Extraction
  • Source field: TEXT
  • Language: English
Text File Output
  • Output file: C:\Data\Staff_out.txt
  • Encoding: System Default
  • Display file after job runs

Additional settings are available to control how the input is converted and extracted and how the output is displayed.

Review the Text

The following display shows the original Excel data:

In contrast, the following display shows the converted and extracted data in a text file:

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfDMStd_Task_DocConvExtract.html