DataFlux Data Management Studio 2.6: User Guide
The information that you need to process is not always found in traditional databases. For example, you might need to take data from a Microsoft Word file or an HTML file and convert it into a format that you can process in a DataFlux Data Management Studio job. You can address this problem with a job that converts the original source document in UTF-8 text, extracts the text, and inserts it into a text file that can be consumed later. Perform the following tasks:
You can use the Document Conversion node to convert a variety of formats into UTF-8 text. The formats listed in the following table are supported:
Document Type | Extensions | Description |
---|---|---|
HTML | HTML, HTM | Hypertext Markup Language files used for web documents |
XML | XML, ODF | Extensible Markup Language formatted files, which is used for many different kinds of files |
Excel | XLS, XLSX | Microsoft Excel Documents |
Word | DOC, DOCX | Microsoft Word Documents |
PowerPoint | PPT | Microsoft PowerPoint Documents later than 2003 |
Publisher | PUB | Microsoft Publisher Documents |
Visio | VSD | Microsoft Visio Documents |
Embedded Objects | Documents embedded within other documents, such as an Excel document within a Word document | |
Open Document Formats | ODT, ODS, ODP, ODG, ODF | Open Document Format (ODF) formatted files and types, such as text (odt), spreadsheets (ods), presentations (odp), and graphics (odg) |
Apple iWorks Formats | PAGES, KEY, NUMBERS | Apple iWorks formats, such as Numbers, Pages, and Keynote used by Appleās iWorks Office Suite |
Portable Document Format | Portable Document Format (PDF) documents | |
Electronic Publication Format | EPUB | Electronic Publication Format, which is used for many digital books |
Rich Text Format | RTF | Rich Text Format (RTF), which is used for formatted text |
You can also extract only the metadata from a selected source file. There are other formats supported for metadata extraction only that do not allow for textual extraction. These formats include image and audio formats such as JPG, PNG, and WAV.
Once you select a source, you can create a data job to run the conversion and extraction. For example,a data job can convert a Microsoft Excel file that contains staff information into a UTF-8 file. Then, it extracts the data from the UTF-8 file and inserts it into a text file, which can be used for further processing in the data job or another DataFlux Data Management Studio application. The Data Flow tab in the job is shown in the following display:
Note that the job contains the Document Conversion, Document Extraction, and Text File Output nodes. The following table lists the main configuration settings for these nodes:
Data Job Node | Setting |
---|---|
Document Conversion |
|
Document Extraction |
|
Text File Output |
|
Additional settings are available to control how the input is converted and extracted and how the output is displayed.
The following display shows the original Excel data:
In contrast, the following display shows the converted and extracted data in a text file:
Documentation Feedback: yourturn@sas.com
|
Doc ID: dfDMStd_Task_DocConvExtract.html |