Establishing Connectivity to Hadoop :: SAS(R) 9.3 Intelligence Platform: Data Administration Guide, Second Edition

Overview of Establishing Connectivity to Hadoop

The following figure provides a logical view of using the SAS/ACCESS Interface to Hadoop to access a Hive Server. The Hive Server is shown running on the same machine as the Hadoop NameNode.

Establishing Connectivity to Hadoop Servers

Topology graphic for Hadoop via Hive Library

Setting up a connection from SAS to a Hadoop Server is a two-stage process:

Register the Hadoop Server.
Register the Hadoop via Hive library.

This example shows the process for establishing a SAS connection to a Hive Server. In order for the SAS/ACCESS Interface to connect with the Hive Server, the machine that is used for the SAS Workspace Server must be configured with several JAR files. These JAR files are used to make a JDBC connection to the Hive Server. The following prerequisites have been satisfied:

installation of SAS/ACCESS Interface to Hadoop. For configuration information, see the Install Center at http://support.sas.com/documentation/installcenter/93 and use the operating system and SAS version to locate the appropriate SAS Foundation Configuration Guide.
installation of the Hadoop JAR files required by SAS.
setting the SAS_HADOOP_JAR_PATH environment variable.

This section describes the steps that are used to access data in Hadoop as tables through a Hive Server. SAS Data Integration Studio offers a series of transformations that can be used to access the Hadoop Distributed File System (HDFS), submit Pig code, and submit MapReduce jobs.

Stage 1: Register the Hadoop Server

To register the Hadoop Server, perform the following steps:

Open the SAS Management Console application.
Right-click Server Manager and select the New Server option to access the New Server wizard.
Select Hadoop Server from the Cloud Servers list. Then, click Next.
Enter an appropriate server name in the Name field (for example, Hadoop Server). Note that you can supply an optional description if you want. Click Next.

Enter the following server properties:

Server Properties

Field	Sample Value
Major Version Number	`0`
Minor Version Number	`20`
Software Version	`204`
Vendor	`Apache`
Associated Machine	Select the host name for the Hadoop NameNode from the menu or click New and specify the host name.

Click Next.

Enter the following connection properties:

Connection Properties

Field	Sample Value
Hadoop NameNode	Specify the host name of the machine that is running the Hadoop NameNode.
Install Location	For deployments that use SAS Visual Analytics Hadoop, specify the path to TKGrid on the machines in the cluster.
Database	Use the default value of `DEFAULT`.
Port Number	Specify the network port number for the Hive Server.
Authentication type	`User/Password`
Authentication domain	`HadoopAuth` (You might need to create a new authentication domain. For more information, see How to Store Passwords for a Third-Party Server in SAS Intelligence Platform: Security Administration Guide.) Click New to access the New Authentication Domain dialog box. Then enter the appropriate value in the Name field and click OK to save the setting.
Configuration	Specify the values for the fs.default.name and mapred.job.tracker properties that are used in the Hadoop cluster.
DFS Http Address	Specify the host name and port for the Hadoop NameNode HTTP address.
DFS Secondary Http Address	Specify the host name and port for the Hadoop SeconaryNameNode HTTP address.
Job Tracker Http Address	Specify the host name and port for the Job Tracker HTTP address.

Examine the final page of the wizard to ensure that the proper values have been entered. Click Finish to save the wizard settings.

Stage 2: Register the Hadoop via Hive Library

After you have registered the Hadoop Server, register the library. To register the Hadoop via Hive library, perform the following steps:

In SAS Management Console, expand Data Library Manager. Right-click Libraries. Then, select the New Library option to access the New Library wizard.
Select Hadoop via Hive Library from the Database Data list. Click Next.
Enter an appropriate library name in the Name field (for example, Hive Library). You can supply an optional description. Click Next.
Select a SAS server from the list and use the right arrow to assign the SAS server. This step makes the library available to the server and makes the library visible to users of the server. Click Next.

Enter the following library properties:

Library Properties

Field	Sample Value
Libref	`HIVEREF`
Engine	`HADOOP`

You can also click Advanced Options to perform tasks such as pre-assignment and optimization. Click Next to access the next page of the wizard.

Enter the following settings:

Server and Connection Information

Field	Sample Value
Database Server	`Hadoop Server` (Use the Hadoop Server that you created in the New Server wizard.)
Database Schema Name	See your Hadoop administrator for the correct value.
Connection	Use the default value of Connection: server_name.
Default Login	Use the default value of `(None)`.

Click Next.

Examine the final page of the wizard to ensure that the proper values have been entered. Click Finish to save the library settings. At this point, register tables as explained in Registering and Verifying Tables.

Special Considerations for Hadoop via Hive Tables

Hadoop via Hive tables can be registered in metadata with clients such as SAS Management Console and SAS Data Integration Studio. However, table metadata cannot be updated after the table is registered in metadata.