Other Applications That Connect to Data

SAS Data Loader for Hadoop

SAS Data Loader for Hadoop uses the Copy Data to Hadoop, Import a File, and Copy Data from Hadoop directives. These directives enable you to move data from your files system or database management systems into and out of Hadoop. The Import a File directive helps you import miscellaneous files from your file system into Hadoop as columnar tables. SAS Data Loader for Hadoop provides a point-and-click interface for moving, cleansing, and analyzing data in Hadoop. It enables business users and data scientists to do self-service data preparation on a Hadoop cluster.

SAS Data Integration Studio

SAS Data Integration Studio supports data access with the following components:

external file tools
tools for accessing web services, message queues, and XML data
SAS table and relational database table tools
high-performance data tools
Hadoop and SAS LASR Analytic Server tools

SAS Data Integration Studio enables you to import and manage external files. An external file, sometimes called a flat file or a raw data file, is a plain text file that often contains one record per line. Within each record, the fields can have a fixed length or they can be separated by delimiters, such as commas. Like SAS or DBMS tables, external files can be used as inputs and outputs in SAS Data Integration Studio jobs.

Unlike SAS or DBMS tables, which are accessed with SAS LIBNAME engines, external files are accessed with SAS INFILE and FILE statements. Accordingly, external files have their own registration wizards for processing delimited external files, fixed-width external files, and external files with user-written code. These registration wizards enable you to create metadata for external files. The metadata is saved to a SAS Metadata Server, where SAS Data Integration Studio can access it.

The following display shows the Column Definitions page from the fixed-width external file wizard:

Fixed-Width Column Definitions

You can use the REST and SOAP transformations to access third-party web services. You can use the queue writer and queue reader transformations for both Microsoft and WebSphere message queues. Finally, you can use the XML Reader and XML Writer transformations to work with XML data. All of these transformations enable you to work with these data types in SAS Data Integration Studio jobs.

DataFlux Data Management Studio

DataFlux Data Management Studio provides connection definitions that you can use to connect to a wide variety of data sources.

The following supported connection types are supported:

ODBC connections
domain-enabled ODBC connections
ODBC connections for Excel
ODBC connections for Hadoop
SAS data set connections
Federation Server connections
Custom SAP connections
Custom SQLite connections
SQL Queries
XML files
Java Message Queues
documents from third-party applications such as Adobe Acrobat, Microsoft PowerPoint, and Microsoft Visio

Most types of data connections in DataFlux Data Management Studio are set with the ODBC Data Source Administrator window. You can scroll the list of data types to select the type of connection that you need. Then, you can use the tabs in the window to set options and parameters for the connection. You can see the current list in of the available databases in the “Supported Databases for Data Storage” section of the “Working with Databases” topic. This topic can be found in the DataFlux Data Management Studio: User’s Guide for your version of DataFlux Data Management Studio. The “Working with Databases” topic is located in the “Data Riser Bar” chapter.

You can create domain-enabled ODBC connections that reference the ODBC connection and an appropriate authentication server domain, which can prevent users from needing to constantly authenticate to the connection.

Domain-enabled ODBC connections are based on the following prerequisites:

a standard ODBC DSN for the data source that you want to access
an authentication domain, user, and login for this ODBC DSN

Note: Domain-enabled connections cannot use shared logins.

Specialized ODBC connections for Excel and Hadoop simplify the process of accessing these data types. The Excel process enables you to select the appropriate driver for your version of Excel and create an ODBC DSN to read named ranges in an Excel spreadsheet. The Hadoop process enables you to select the appropriate DataFlux Apache Hive Wire Protocol driver or the DataFlux Impala Wire Protocol driver for your site. Then, you can click System DSN to create a connection to a data source that all users on the machine can access.

SAS data sets can be accessed through the SAS Data Set Connection window, which connects to a folder that contains one or more SAS data sets. The data is accessed directly on disk without mediation by a SAS Application Server. The SAS connection points to a folder on the file system that contains SAS data sets.

The host that executes the connection must be able to access the folder that contains the SAS data. For example, the DataFlux Data Management Studio host is a Windows host. If the SAS data sets are on a UNIX host, you need a networking protocol like SAMBA (SMB/CIFS). You also could use a network file system (NFS) that exposes the UNIX file system as a Windows directory.

These SAS data set connections can be configured in the SAS Data Set Connection window. For example, you can specify an access level and specify whether the data should be compressed. You can also specify options for features such as table locking and encryption. Finally, you can check the connection string to see whether the appropriate options encoding has been selected for a given connection.

If a SAS Federation Server is available on your site, you can use the Data riser to connect to that server and access the DSN connections that are managed by that server. The Federation Server Connection window enables you to specify a server and port for the connection. It also supports compression and credentials settings. You can also test the connection to the server.

You can add a user-defined connection to an SAP system. This connection could be used as the data source for the SAP Remote Function Call node, a data job node. SAP libraries (DLLs) must be installed on all computers where this custom connection is used.

You can also add a user-defined connection for an SQLite database file. For example, an SQLite connection can be used in the definition for an Address Update repository.

DataFlux Data Management Studio contains a set of data job nodes that enable you to use XML data and XML column data as inputs and outputs in data jobs. Similarly, it supports the Java Message Service (JMS). This Java API enables applications to create, send, receive, and read messages with data and process job reader and writer nodes. Web services are supported with the Web Service and HTTP Request data job nodes. You can use the Document Extraction data job node to find information that you need to process that is not always found in traditional databases. For example, you might need to take data from a Microsoft Word file or an HTML file. Then you can convert it into a format that you can process in a DataFlux Data Management Studio job.