Using SPD Server with Hadoop

Overview

Hadoop is an open-source software framework under the Apache Software Foundation. You use Hadoop to store and process big data in a distributed fashion on large clusters of commodity hardware. Hadoop is an attractive technology for a number of reasons:
  • Cost-effective alternative for data storage
  • Scalable architecture through distributed data across multiple machines
  • Data replication provides failure resiliency
Beginning with SPD Server 5.2, SPD Server can read, write and update tables in the Hadoop environment. SPD Server support for Hadoop is enabled by the SPD Server administrator. Changes to the SPD Server configuration must be made in the LIBNAME parameter file and the server parameter file. New SPD Server Hadoop options have also been introduced to specify Hadoop operational parameters. SPD Server 5.2 supports Hadoop access only on the Linux platform.

Pre-installation Checklist for Interfacing with Hadoop

A good understanding of your Hadoop environment is critical to a successful installation of an SPD Server that interfaces with Hadoop. Before you install SPD Server software that interfaces with Hadoop, it is recommended that you verify your Hadoop environment by becoming familiar with the following items:
  • Gain working knowledge of the Hadoop distribution that you are using (for example, Cloudera). You will also need working knowledge of the Hadoop Distributed File System (HDFS), and services for MapReduce 1, MapReduce 2, and YARNservices. For more information, see the Apache website or the vendor’s website.
  • Ensure that the HDFS, MapReduce, and YARN services are running on the Hadoop cluster.
  • Know the location of the MapReduce home.
  • Know the host name of the NameNode.
  • Determine where the HDFS server is running.
  • Understand and verify your Hadoop user authentication.
  • Understand and verify your security setup. It is highly recommended that you enable Kerberos for data security.
  • Verify that you can connect to your Hadoop cluster (HDFS) from your client machine outside of the SPD Server environment with your defined security protocol.

Steps to Configure SPD Server for Use with Hadoop

Verify that all SPD Server Hadoop prerequisites have been satisfied. This step ensures that you understand your Hadoop environment. For more information, see Hadoop Prerequisites for SPD Server.
Make Hadoop JAR files available to the SPD Server machine by using the SPD Server HADOOPJAR= option in the LIBNAME parameter file to specify the location of the files. This step involves copying a set of JAR files to the SPD Server machine that accesses Hadoop. For more information, see Make Required Hadoop JAR Files Available to SPD Server.
Make Hadoop configuration files available to the SPD Server machine. For more information, see Make Required Hadoop Cluster Configuration Files Available to SPD Server.
Run basic tests to confirm that your Hadoop connections are working. For more information, see Validating the SPD Server to Hadoop Connection.

Hadoop Prerequisites for SPD Server

Install SPD Server 5.2

Install SPD Server 5.2 by following the instructions. For more information about installing SPD Server, see Installing SAS Scalable Performance Data (SPD) Server on UNIX.

Supported Hadoop Distributions for SPD Server

SPD Server 5.2 supports the following distributions:
  • Cloudera CDH 5.2.
  • Hortonworks Data Platform (HDP) 2.1

Hadoop Configuration File and JRE Requirements

The Hadoop configuration file must include the properties to run MapReduce (MR1) or MapReduce 2 (MR2) and YARN. The JRE version for the Hadoop cluster must be either 1.6, which is the default configuration, or 1.7. If the JRE version for the Hadoop cluster is 1.7, use the HADOOPACCELJVER= server parameter file setting use the HADOOPACCELJVER= server parameter file setting HADOOPACCELJVER= to specify the updated JRE version.

Make Required Hadoop JAR Files Available to SPD Server

In order to interoperate with a Hadoop environment, you must make certain Hadoop JAR files available to the SPD Server machine. In the context of Hadoop, SPD Server is a client to the Hadoop cluster. To make the required JAR files available, you must define the HADOOPJAR= option in the SPD Server parameter file and specify HADOOP=YES for each domain definition in the LIBNAME parameter file that you want to be a Hadoop domain.
Alternatively, you can use a HADOOPJAR= statement in your LIBNAME parameter file. If you specify any HADOOP option within a domain definition, SPD Server automatically assumes the setting HADOOP=YES for that domain.
To specify the location of the required Hadoop JAR files:
  1. Create a directory that is accessible to the SPD Server machine.
  2. From the Hadoop cluster, copy the required JAR files for the particular Hadoop distribution to the directory that you created in Step 1.
  3. Use the SPD Server HADOOPJAR= server parameter file option to specify the path to the Hadoop JAR files.
For example, if the JAR files are copied to the location /u/hadoop/hdist/cdh/cdh52, then the following syntax sets the option appropriately.
HADOOPJAR=/u/hadoop/hdist/cdh/cdh52;
If multiple paths are required, you can use the colon operator : to concatenate pathnames:
HADOOPJAR=/u/hadoop/hdist/cdh/cdh52:/u/myjars/cdh/cdh52;
The Java version of the JAR files in your HADOOPJAR folder is important. The Java version of all Hadoop JAR files must exactly match the Java version that the corresponding Hadoop environment is using. If you have multiple Hadoop servers that are running different Java versions, then you must create a different SPD Server HADOOPJAR= folder for each unique Java version that you use in your Hadoop environments.
SPD Server 5.2 requires the following JAR files for Hadoop domains that use Cloudera:
  • commons-beanutils-1.7.0.jar
  • commons-cli-1.2.jar
  • commons-codec.1.4.jar
  • commons-collections-3.2.1.jar
  • commons-compress-1.4.1jar
  • commons-configuration-1.6.jar
  • commons-daemon-1.0.13.jar
  • commons-digester-1.8.jar
  • commons-el-1.0.jar
  • commons-httpclient-3.1.jar
  • commons-lang-2.6.jar
  • commons-logging-1.1.3.jar
  • guava-12.0.1.jar
  • hadoop-auth-2.5.0-cdh5.2.0.jar
  • hadoop-common-2.5.0-cdh5.2.0.jar
  • hadoop-hdfs-2.5.0-cdh5.2.0.jar
  • hadoop-mapreduce-client-app-2.5.0-cdh5.2.0.jar
  • hadoop-mapreduce-client-common-2.5.0-cdh5.2.0.jar
  • hadoop-mapreduce-client-core-2.5.0-cdh5.2.0.jar
  • hadoop-mapreduce-client-jobclient-2.5.0-cdh5.2.0.jar
  • hadoop-mapreduce-client-shuffle-2.5.0-cdh5.2.0.jar
  • hadoop-yarn-api-2.5.0-cdh5.2.0.jar
  • hadoop-yarn-client-2.5.0-cdh5.2.0.jar
  • hadoop-yarn-common-2.5.0-cdh5.2.0.jar
  • hadoop-yarn-server-common-2.5.0-cdh5.2.0.jar
  • httpclient-4.2.5.jar
  • httpcore-4.2.5.jar
  • jackson-core-asl-1.8.8.jar
  • jackson-jaxrs-1.8.8.jar
  • jackson-mapper-asl-1.8.8.jar
  • jackson-xc-1.8.8.jar
  • libfb303-0.9.0.jar
  • log4j-1.2.17.jar
  • protobuf-java-2.5.0.jar
  • slf4j-api-1.7.5.jar
  • slf4j-log4j12.jar
SPD Server 5.2 requires the following JAR files for Hadoop domains that use HDP 2.1:
  • commons-beanutils-1.7.0.jar
  • commons-cli-1.2.jar
  • commons-codec-1.4.jar
  • commons-collections-3.2.1.jar
  • commons-configuration-1.6.jar
  • commons-lang-2.6.jar
  • commons-logging-1.1.3.jar
  • guava-11.0.2.jar
  • hadoop-auth-2.4.0.2.1.7.0-784.jar
  • hadoop-common-2.4.0.2.1.7.0-784.jar
  • hadoop-hdfs-2.4.0.2.1.7.0-784.jar
  • hadoop-mapreduce-client-app-2.4.0.2.1.7.0-784.jar
  • hadoop-mapreduce-client-common-2.4.0.2.1.7.0-784.jar
  • hadoop-mapreduce-client-core-2.4.0.2.1.7.0-784.jar
  • hadoop-mapreduce-client-jobclient-2.4.0.2.1.7.0-784.jar
  • hadoop-mapreduce-client-shuffle-2.4.0.2.1.7.0-784.jar
  • hadoop-yarn-api-2.4.0.2.1.7.0-784.jar
  • hadoop-yarn-client-2.4.0.2.1.7.0-784.jar
  • hadoop-yarn-common-2.4.0.2.1.7.0-784.jar
  • hadoop-yarn-server-common-2.4.0.2.1.7.0-784.jar
  • httpclient-4.2.5.jar
  • httpcore-4.2.5.jar
  • jackson-core-asl-1.8.8.jar
  • jackson-jaxrs-1.8.8.jar
  • jackson-mapper-asl-1.8.8.jar
  • jackson-xc-1.8.8.jar
  • jline-0.9.94.jar
  • libfb303-0.9.0.jar
  • log4j-1.2.17.jar
  • protobuf-java-2.5.0.jar
  • slf4j-api-1.7.5.jar
  • slf4j-log4j12-1.7.5.jar

Make Required Hadoop Cluster Configuration Files Available to SPD Server

Before you can connect SPD Server to a Hadoop server, you must first make the Hadoop cluster configuration files available to the SPD Server machine. To make the Hadoop cluster configuration files available, you must define the HADOOPCFG= option in your SPD Server parameter file, and then specify HADOOP=YES for each Hadoop domain definition.
If you specify any Hadoop option (such as HADOOPCFG= or HADOOPJAR=) as part of a domain definition, SPD Server automatically assumes a default option setting of HADOOP=YES.
To specify the location of your Hadoop configuration files:
  1. Create a directory that is accessible to the SPD Server machine.
  2. From the specific Hadoop cluster, copy the following configuration files to the directory that you created in step 1:
    • core-site.xml
    • hdfs-site.xml
    • mapred-site.xml
    • yarn-site.xml.
  3. Use the HADOOPCFG= server parameter file option to specify the path to the directory containing the Hadoop configuration files.
For example, if the cluster configuration files are copied to the location /u/hadoop/hdist/cdh/confdir, then the following syntax sets the option appropriately:
HADOOPCFG=/u/hadoop/hdist/cdh/confdir;

Kerberos Security

To access Hadoop clusters that are secured with Kerberos, you must provide HADOOPREALM= and HADOOPKEYTAB= option values in the SPD Server parameter file. Specifying these options in the server parameter file means that any Hadoop domain that is defined in the LIBNAME parameter file is treated as a domain that is secured with Kerberos. If you have a Hadoop LIBNAME domain definition that you do not want to be treated as a domain that is secured with Kerberos, you can include the option HADOOPKERBEROS=NO as part of the Hadoop LIBNAME domain definition in the parameter file.
Alternatively, you can use the HADOOPREALM= and/or HADOOPKEYTAB= options on individual Hadoop domain definitions in the LIBNAME parameter file to specify that you want those domains to be treated as domains that are secured with Kerberos.
For example, if your Kerberos realm name is FOO.ORG, and your keytab file is in /<keytab-path>/userid.keytab, then the following LIBNAME parameter file syntax sets the HADOOPREALM= and HADOOPKEYTAB= options appropriately:
HADOOPREALM=FOO.ORG;
HADOOPKEYTAB=<keytab_path/userid.keytab>;

SPD Server ACLs for Hadoop Domains

SPD Server ACLs for SPD Server resources are typically created in the root pathname of each SPD Server domain. Storing SPD Server ACLs in this location does not work for domains defined by Hadoop, for the following reasons:
  • SPD Server ACLs are small, and are updated frequently. Updating data in the Hadoop Distributed File System (HDFS) is very slow. This would represent a significant performance degradation if the SPD Server ACLs were stored in HDFS.
  • HDFS does not support the type of locking required for SPD Server ACL processing.
Because of the above restrictions, SPD Server ACLs for Hadoop domains must be stored in the local file system. The SPD Server parameter file option, HADOOPACLPATH=, specifies a local file system directory where ACLs for the Hadoop environment will be created and stored.
If the HADOOPACLPATH= option is not defined, the default location for SPD Server Hadoop ACLs is the location that is specified as the location provided by the ACLDIR= SPD Server start up option in the rc.spds file. If you define an SPD Server LIBNAME domain that contains a HADOOP=YES setting, SPD Server creates a directory using the following schema:
<HADOOPACLPATH>/HADOOPACLS/<domain_name>.
The <HADOOPACLPATH> value is either the HADOOPACLPATH= parameter option, or the ACLDIR= start up option in the rc.spds file. The specified location contains the ACLs for the declared libref.

SPD Server User Password File

By default, the SPD Server user password file is located in the InstallDir/site directory. If you do not keep the file in this location, you need to specify the location using the ACLDIR start up option. This location must be on the local file system, not HDFS. This restriction exists for the following reasons:
  • SPD Server password files are small, and are updated frequently. Updating data in HDFS is very slow. This would represent a significant performance degradation if the SPD Server password files were stored in HDFS.
  • HDFS does not support the type of locking required for SPD Server password file processing.

Validating the SPD Server to Hadoop Connection

To validate that your SPD Server is correctly configured to access Hadoop files:
  • Ensure that you specified values for HADOOPCFG= and HADOOPJAR= option settings in either your SPD Server parameter file, or within specific LIBNAME definitions in your LIBNAME parameter file.
  • If you want one or more of your Hadoop domains to access Hadoop files that are secured with Kerberos, make sure that you specified values for HADOOPREALM= and HADOOPKEYTAB= option settings for the domains in the SPD Server parameter file (or within specific LIBNAME definitions in the LIBNAME parameter file.)
  • After you start your SPD Server session, check the SPD Server log file to verify that your Hadoop domains were added to the SPD Name Server. Hadoop option information is logged when each Hadoop domain is added to the SPD Server Name Server. The SPD Server log entry should resemble the following:
    Libname MYDOMAIN added to Name Server (HADOOP=yes)
    HADOOPCFG=/u/hadoop/hdist/cdh/confdir;
    HADOOPJAR=/u/hadoop/hdist/cdh/cdh52;
    HADOOPACLPATH=/u/mylocal/acls/HADOOPACLS/MYDOMAIN;
  • Submit a LIBNAME assignment to a Hadoop SPD Server domain that resembles the following:
    LIBNAME foo sasspds ‘public’ server=myhost.5400 user=”anonymous”;
    If the LIBNAME assignment is successful, then your Hadoop connection has been made.
  • Add MYDOMAIN to your LIBNAME parameter file as a HADOOP=YES libref with a path located in your Hadoop file system.

Example of Using a LIBNAME Statement to Validate a Hadoop Environment

The example code below shows how to issue a LIBNAME statement and two PROC steps that will validate whether your Hadoop environment is set up correctly.
LIBNAME foo sasspds 'mydomain' server=myhost.5400 user="anonymous";

NOTE: This is a SPD 5.2  Engine
      executing SAS (r) 9.4 (TS1M2) on the Linux platform.
NOTE: User anonymous(ACL Group ) connected to SPD(LAX) 5.2 server at
      10.24.7.79.
NOTE: Libref FOO was successfully assigned as follows:
      Engine:        SASSPDS
      Physical Name: :23107/my_Hadoop_domain_path/

data foo.atable; 
x=1;
run;

NOTE: The data set FOO.TABLE has 1 observations and 1 variables. 
NOTE: DATA statement useed (Total process time):
      real time		3.69 seconds
      cpu time    0.06 seconds


proc datasets lib=foo; run;

                                  Directory

    Libref               foo
    Engine               SASSPDS
    Physical Name        :23107/my_Hadoop_domain_path/
    Local Host Name      myhost
    Local Host IP addr   10.24.7.79
    Server Hostname      N/A
    Server IP addr       10.24.7.79
    Server Portno        55120
    Free Space (Kbytes)  9.0071993E15
    Metapath             '/my_Hadoop_domain_path/'
    Indexpath            '/my_Hadoop_domain_path/'
    Datapath             '/my_Hadoop_domain_path/'


                                          Member
                              #  Name     Type

                              1  ATABLE   DATA
                    
NOTE: PROCEDURE DATASETS used (Total process time):
      real time           2:02.19
      cpu time            0.16 seconds


  PROC CONTENTS data=foo.atable; run;

  PROC PRINT data=foo.atable; run;


                            The CONTENTS Procedure

    Data Set Name        FOO.ATABLE               Observations          1
    Member Type          DATA                     Variables             1
    Engine               SASSPDS                  Indexes               0
    Created              01/28/2015 11:01:58      Observation Length    8
    Last Modified        01/28/2015 11:01:58      Deleted Observations  0
    Protection                                    Compressed            NO
    Data Set Type                                 Sorted                NO
    Label
    Data Representation  Default
    Encoding             latin1  Western (ISO)


                      Engine/Host Dependent Information

                    Blocking Factor (obs/block)  131072
                    ACL Entry                    NO


                            The CONTENTS Procedure

                      Engine/Host Dependent Information

                    ACL User Access(R,W,A,C)     (Y,Y,Y,Y)
                    ACL UserName                 ANONYMOU
                    ACL OwnerName                ANONYMOU
                    Data set is Ranged           NO
                    Data set is a Cluster        NO


                  Alphabetic List of Variables and Attributes

                         #    Variable    Type    Len

                         1    x           Num       8

NOTE: PROCEDURE CONTENTS used (Total process time):
      real time           0.50 seconds
      cpu time            0.04 seconds

SPD Server Hadoop LIBNAME Parameter and Server Parameter File Options

You can specify the Hadoop options in either the LIBNAME parameter file (as part of an individual domain definition) or in the SPD Server parameter file. Hadoop options that you specify in the LIBNAME parameter file take precedence over the Hadoop options that you specify in the SPD Server parameter file. This allows the SPD Server administrator to define common Hadoop options in the SPD Server parameter file, and to use LIBNAME parameter file options to override server parameter file settings when required.
The required options to access Hadoop are HADOOPCFG= and HADOOPJAR= in either parameter file, and the HADOOP=YES option in the LIBNAME parameter file.
When you specify HADOOPCFG= and HADOOPJAR= option settings in the SPD Server parameter file, these settings become the default settings for any domain that specifies HADOOP=YES.

HADOOP= YES|NO

specifies whether a domain can access data in the Hadoop file system.

YES

specifies that a domain can access data in the Hadoop file system.

NO

specifies that a domain cannot access data in a Hadoop file system.

Valid in LIBNAME parameter file
Default NO

HADOOPACCELJVER=

specifies the Java version used in the Hadoop environment when performing SPD Server WHERE processing optimization with MapReduce.

Valid in SPD Server parameter file
Interactions The HADOOPACCELJVER= option affects only domains that specify HADOOP=YES or domains that specify any other SPD Server HADOOP* option.
The SPD engine submits the Java class to the Hadoop cluster as a component in a MapReduce program. By requesting that data subsetting be performed in the Hadoop cluster, performance might be improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SAS client. Performance is often improved with large data sets when the WHERE expression qualifies only a relatively small subset. The default value is 1.6.
See For more information about SPD Server and WHERE optimization with MapReduce, see SPD Server Hadoop WHERE Processing Optimization Data Set and SAS Code Requirements
Example
HADOOPACCELJVER= <Java-version> ;

HADOOPACCELWH=

specifies whether to perform SPD Server WHERE processing optimization with MapReduce by performing data subsetting in the Hadoop cluster.

Valid in SPD Server parameter file
Interactions The HADOOPACCELWH= option affects only domains that specify HADOOP=YES or domains that specify any other SPD Server HADOOP* option.
By requesting that data subsetting be performed in the Hadoop cluster, performance might be improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SPD Server client. Performance is often improved with large data sets when the WHERE expression qualifies only a relatively small subset.
See For more information about SPD Server and WHERE optimization with MapReduce, see SPD Server Hadoop WHERE Processing Optimization Data Set and SAS Code Requirements
Example
HADOOPACCELWH= YES|NO ;

HADOOPACLPATH= local-file-system-directory-for-Hadoop-ACL-files

specifies the local file system directory to store Hadoop ACL files.

Valid in SPD Server parameter file
Default The ACL files are created in the location identified in the –acldir spdsserv process start-up parameter. This is the same location as the psmgr database.
Interaction If you specify a value for the -hadoopaclpath option with the SPDSCLEAN utility, that value overrides any other HADOOPACLPATH= settings that are specified in the SPD Server parameter file.

HADOOPCFG=local-path-to-Hadoop-configuration-file-directory

specifies the local path to the Hadoop configuration files.

Valid in LIBNAME parameter file
SPD Server parameter file
Interactions When you specify this option in the SPD parameter file, this option affects only domains that specify HADOOP=YES or domains that have other SPD Server HADOOP* option specified.
If you specify any SPD Server HADOOP* domain option, the SPD Server assumes that the HADOOP=YES option is set for that domain.
Example HADOOPCFG=/u/hadoop/hdist/cdh/confdir;

HADOOPJAR=local-path-to-Hadoop-JAR-file-directory

specifies the local path to the Hadoop JAR files.

Valid in LIBNAME parameter file
SPD Server parameter file
Interactions When you specify this option in the SPD Server parameter file, this option affects only domains that specify HADOOP=YES or domains that have other SPD Server HADOOP* option specified.
If you specify any SPD Server HADOOP* domain option, the SPD Server assumes that the HADOOP=YES option is set for that domain.
Example HADOOPJAR=/u/hadoop/hdist/cdh/cdh52;

HADOOPKERBEROS= YES|NO

specifies whether a Hadoop cluster requires authentication using Kerberos.

YES

specifies that a Hadoop cluster requires authentication using Kerberos.

NO

specifies that a Hadoop cluster does not require authentication using Kerberos.

Valid in LIBNAME parameter file
Interactions To access a Hadoop cluster that is secured with Kerberos, the HADOOPREALM= and the HADOOPKEYTAB= options must be specified in either the LIBNAME parameter file or the SPD Server parameter file.
If either the HADOOPREALM= HADOOPKEYTAB= option is specified in the SPD Server parameter file, then a domain that has HADOOP=YES specified is secured by Kerberos unless HADOOPKERBEROS=NO.

HADOOPKEYTAB=path-to-Kerberos-keytab-file

specifies the path to a Kerberos keytab file when accessing a Hadoop cluster that is secured with Kerberos. If you specify any SPD Server HADOOP* domain option, then SPD Server assumes that HADOOP=YES is set for that domain.

Valid in LIBNAME parameter file
SPD Server parameter file
Interactions When you specify this option in the SPD Server parameter file, this option affects only domains that specify HADOOP=YES or domains that have other SPD Server HADOOP* option specified.
If you specify any SPD Server HADOOP* domain option, the SPD Server assumes that the HADOOP=YES option is set for that domain.

HADOOPREALM=Kerberos-realm

specifies the Kerberos realm to use to access a Hadoop cluster that is secured with Kerberos.

Valid in LIBNAME parameter file
SPD Server parameter file
Interactions When you specify this option in the SPD Server parameter file, this option affects only domains that specify HADOOP=YES or domains that have other SPD Server HADOOP* option specified.
If you specify any SPD Server HADOOP* domain option, the SPD Server assumes that the HADOOP=YES option is set for that domain.

HADOOPWORKPATH=path-to temporary MapReduct output

specifies the path to the directory in the Hadoop Distributetd File System that stores the temporary results of the MapReduce output.

Valid in SPD Server parameter file
Default /tmp
Interaction This option affects only domains that specify HADOOP=YES or domains that specify any other SPD Server HADOOP* option.

Using SPDSCLEAN with Hadoop

You can use the SPD Server SPDSCLEAN utility to clean Hadoop domains. There are a few considerations when using spdsclean on Hadoop domains. You must first provide Hadoop with configuration and JAR resource information for the target domain. The spdsclean utility is located in the /site directory of your SPD Server installation location.

Hadoop SPDSCLEAN Configuration Information

To use the SPDSCLEAN functionality within a Hadoop domain, you must provide the Hadoop configuration path and the Hadoop JAR information, in the same way that you must configure SPD Server to access Hadoop environments. SPD Server provides several methods that you can use to invoke the SPDSCLEAN function within a Hadoop domain.
You can use the SPDSCLEAM utility to invoke SPDSCLEAN for a Hadoop domain. Use the -hadoopaclpath option of the SPDSCLEAN command to specify the location of the Hadoop ACL files that you want to clean. If you specify a value for the -hadoopaclpath option, that value overrides any other HADOOPACLPATH= settings declared in the SPD Server parameter file. The Hadoop ACL path is needed for SPDSCLEAN to access and clean the SPD Server ACL files in the specified Hadoop domain. You must also specify the SPDSCLEAN -acl or -all options in order for SPDSCLEAN to clean up any SPD Server ACL files.

Configure Hadoop SPDSCLEAN via Environment Variables

You can use environment variables to specify Hadoop CONFIG_PATH and JAR_PATH configuration parameters for SPDSCLEAN. You specify values for the parameters via UNIX command prompt, before submitting the spdsclean script, or you can incorporate the parameter value statements within the spdsclean script itself.
export SAS_HADOOP_CONFIG_PATH=/u/fedadmin/hadoopcfg/cdh52p1
export SAS_HADOOP_JAR_PATH=/u/fedadmin/hadoopjars/cdh52

Configure Hadoop SPDSCLEAN via SPDSCLEAN Options

You can modify your spdsclean script to use the -hadoopcfg and -hadoopjar option settings to specify the values for your Hadoop config path and JAR path.
spdsclean -hadoopcfg /u/fedadmin/hadoopcfg/cdh52p1
          -hadoopjar /u/fedadmin/hadoopjars/cdh52
          other-options;

Configure Hadoop for SPDSCLEAN via –libnamefile and –parmfile Options in spdsclean Script

You can modify your spdsclean script to reference Hadoop configuration information that is defined in your LIBNAME parameter file, your SPD Server parameter file, or both.
Typical content for the spdsclean script:
spdsclean -libnamefile libnames.parm -parmfile spdsserv.parm
Typical content for the LIBNAME parameter file:
libname=Stuff1
 pathname=/user/userlname
 hadoopcfg=/u/fedadmin/hadoopcfg/cdh52p1
 hadoopjar=/u/fedadmin/hadoopjars/cdh52
 hadoop=yes;
Note: It is not necessary to specify Hadoop configuration information in both the LIBNAME parameter file and the SPD Server parameter file. You can choose either method. Examples of both methods are provided for convenience.

Specify Hadoop SPDSCLEAN Configuration Information for Domains Secured with Kerberos

To use the SPDSCLEAN utility on Hadoop domains that are secured via Kerberos, submit the following kinit command before invoking SPDSCLEAN: $ kinit -kt <full—path—to—keytab—file><UserID>;

SPD Server Hadoop Configuration Examples

The following sections document different methods that you can use to configure SPD Server to access Hadoop clusters.

Configure SPD Server for Hadoop Access via LIBNAME Domain Statements

One way to set up access to a Hadoop cluster (without Kerberos) is to specify a domain in the SPD Server LIBNAME parameter file, and then add the HADOOPCFG= and HADOOPJAR= options.
In the libnames.parm file:
libname=foo 
  pathname=/user/userlname/mydomain 
  hadoopcfg=/u/hadoop/hdist/cdh/confdir/hdp20p1
  hadoopjar=/u/hadoop/hdist/cdh/sas_cdh20u;

Configure SPD Server for Hadoop Access via spdsserv.parm Global Options

Another simple way to access a Hadoop cluster (without Kerberos) is to specify HADOOPCFG= and HADOOPJAR= options in your SPD Server parameter file. Then, you can specify multiple Hadoop domains in your LIBNAME parameter file. Each of those domains will get their Hadoop configuration and JAR file information from the global options that are specified in the SPD Server parameter file.
In the SPD Server parameter file:
HADOOPCFG=/u/hadoop/hdist/cdh/confdir/hdp20p1;
HADOOPJAR=/u/hadoop/hdist/cdh/sas_cdh20u;
  
In the LIBNAME parameter file, you only need to specify the Hadoop libname and pathname, and HADOOP=YES.
libname=Stuff1 pathname=/user/userlname/mydomain1 hadoop=yes;
libname=Stuff2 pathname=/user/userlname/mydomain2 hadoop=yes;
To use the Hadoop domains defined in the LIBNAME parameter file, specify them in the same way as any other SPD Server domain in client code:
libname Stuff1 sasspds ‘Stuff1’ server=lax94d01.14526 user=”anonymous”;

Configure SPD Server for Hadoop WHERE Processing Optimization with MapReduce

SPD Server WHERE processing enables you to conditionally select a subset of observations, so the software processes only the observations that meet specified conditions. To optimize the performance of WHERE processing, you can request that data subsetting be performed in the Hadoop cluster. Then, when you submit SPD Server code that includes a WHERE expression (that defines the condition that selected observations must satisfy), SPD Server instantiates the WHERE expression as a Java class. The software submits the Java class to the Hadoop cluster as a component in a MapReduce program. By requesting that data subsetting be performed in the Hadoop cluster, performance might be improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SPD Server client. Performance is often improved with large data sets when the WHERE expression qualifies only a relatively small subset.
The action of pushing WHERE-based data subsetting to the Hadoop cluster is sometimes called a WHERE pushdown.
If you want additional details about an SPD Server MapReduce job, you can include the macro SPDSWEB=YES in your code to determine whether the optimization occurred.

SPD Server Hadoop WHERE Pushdown Server Parameter Options

By default, data subsetting is performed by SPD Server on the SPD Server host. To request that data subsetting be performed in the Hadoop cluster, you must specify the following in your SPD Server parameter file:
  • Specify HADOOPACCELWH=YES to enable the Java MapReduce data subsetting. If not specified, HADOOPACCELWH= defaults to NO. Users can override an unspecified HADOOPACCELWH= setting via the SPDSACWH macro variable or the ACCELWHERE= table option.
  • Use the SPD Server parameter file option HADOOPACCELJVER=Java-version to specify the Java version running on the HDFS cluster. If not specified, the Java version defaults to 1.6.
  • Use the SPD Server parameter file option HADOOPWORKPATH= Hadoop-file-system-path to specify the path to the directory that will contain the temporary results of the MapReduce output. If the Hadoop workpath parameter is not specified, by default the files are written to /tmp. The Hadoop workpath that you specify must be an existing path structure. If you specify a HADOOPWORKPATH= parameter that does not exist, the map reduce job fails.

SPD Server Hadoop WHERE Processing Optimization Overrides

SPD Servers users might want to experiment with SPD Server WHERE processing optimization when the SPD Server site administrator has not yet specified a setting for the HADOOPACCELWH= server parameter file option. SPD Server provides user override settings so that users can try WHERE processing optimization in select cases when the feature is not globally enabled.
  • If HADOOPACCELWH= is set to NO in the server parameter file by the SPD Server administrator, no WHERE processing optimization is allowed. User overrides are not available.
  • If HADOOPACCELWH= is set to YES in the server parameter file by the SPD Server administrator, SPD Server users can override and disable WHERE processing optimization by setting the SPDSACWH macro to NO or by issuing the ACCELWHERE=NO table option.
  • If no HADOOPACCELWH= setting is defined in your installation’s server parameter file, WHERE processing optimization is disabled by default. However, SPD Server users can override and enable WHERE processing optimization by either setting the SPDSACWH macro to YES or by issuing the ACCELWHERE=YES table option in a statement.
For more information about the SPDSACWH= macro variable, see Variables to Enhance Performance in SAS Scalable Performance Data Server: User's Guide. For more detailed information about the ACCELWHERE= table option, see Options to Enhance Performance in SAS Scalable Performance Data Server: User's Guide.

SPD Server Hadoop WHERE Processing Optimization Data Set and SAS Code Requirements

To perform the data subsetting in the Hadoop cluster for WHERE processing optimization, the following data set and SAS code requirements must be met. If any of these requirements are not met, data subsetting is not performed by MapReduce in the Hadoop cluster.
  • The data set cannot be encrypted.
  • The data set cannot be compressed.
  • The data set must be larger than the HDFS block size.
  • The submitted SAS code cannot request BY-group processing.
  • The submitted SAS code cannot include the STARTOBS= or ENDOBS= options.
  • The data set must not be an SPD Server cluster.
To perform data subsetting in the Hadoop cluster for WHERE processing optimization, the submitted WHERE expression cannot include any of the following syntax:
  • a variable as an operand, such as WHERE lastname;
  • variable-to-variable comparison
  • SAS functions such as SUBSTR, TODAY, UPCASE, and PUT
  • arithmetic operators *, / , + , – , and **
  • IS NULL or IS MISSING and IS NOT NULL or IS NOT MISSING operators
  • concatenation operators, such as || (two OR symbols) or !! (two exclamation marks)
  • negative prefix operator, such as WHERE z = —(x+y);
  • pattern matching operators LIKE and CONTAINS
  • sounds-like operator SOUNDEX (=*)
  • truncated comparison operator using the colon modifier, such as WHERE lastname=: ‘S’;

SPD Server Hadoop Restrictions

Pass-Through LOAD and COPY Commands

Some SPD Server commands require local direct access to the source and destination tables from the machine that the server is running on. These commands do not work if the tables are in a Hadoop environment. The Hadoop environment currently does not support the direct open method for tables.
As a result, the SPD Server pass-through SQL commands for COPY and LOAD are not available for tables that reside in a Hadoop domain.

Implicit Pass-Through and Hadoop Tables

A Hadoop table cannot be replaced by using implicit pass-through.
An implicit pass-through using the CREATE TABLE AS SELECT statement to SPD Server does not support creating a Hadoop table that already exists. A best practice is to drop a table before creating the table. If a table already exists, SPD Server fails the query with a Member Lock error. This results in PROC SQL planning and executing the create table as select …. query, causing the selected rows to be read from SPD Server to SAS. The selected rows are returned to SPD Server to create the table.

SPD Server User-Defined Formats

SPD Server does not support user-defined formats in Hadoop. User-defined formats require a record-level locking proxy, but the Hadoop file system locking mechanisms do not support this operation. As a result, the SPD Server formats domain does not support the HADOOP=YES option.

Librefs with BACKUP=YES and DYNLOCK=YES Options

SPD Server does not support the use of any LIBNAME in a Hadoop environment that has BACKUP=YES or DYNLOCK=YES option settings. The Hadoop file system locking mechanisms do not support these operations. If these are found together, the server does not start, and an error is displayed in the SAS log.

Librefs with LOCKING=YES Options

SPD Server does not support the LOCKING=YES LIBNAME option for a HADOOP=YES domain.