This section describes
how to install and configure the in-database deployment package for
Hadoop (SAS Embedded Process).
The in-database deployment
package for Hadoop must be installed and configured before you can
perform the following tasks:
-
Run a scoring model in Hadoop Distributed
File System (HDFS) using the SAS Scoring Accelerator for Hadoop.
-
Run DATA step scoring programs
in Hadoop.
-
Run DS2 threaded programs in Hadoop
using the SAS In-Database Code Accelerator for Hadoop.
-
Read and write data to HDFS in
parallel for SAS High-Performance Analytics.
Note: For deployments that use
SAS High-Performance Deployment of Hadoop for the co-located data
provider, and access SASHDAT tables exclusively, SAS/ACCESS and SAS
Embedded Process are not needed.
Note: If you are installing the
SAS High-Performance Analytics environment, you must perform additional
steps after you install the SAS Embedded Process. For more information,
see SAS High-Performance Analytics Infrastructure: Installation
and Configuration Guide.
-
Transform data in Hadoop and extract
transformed data out of Hadoop for analysis in SAS with the SAS Data
Loader for Hadoop. For more information, see SAS Data Loader
for Hadoop: User’s Guide.
-
Perform data quality operations
in Hadoop using the SAS Data Loader for Hadoop. For more information,
see SAS Data Loader for Hadoop: User’s Guide.
The in-database deployment
package for Hadoop includes the SAS Embedded Process and two SAS Hadoop
MapReduce JAR files. The SAS Embedded Process is a SAS server process
that runs within Hadoop to read and write data. The SAS Embedded Process
contains macros, run-time libraries, and other software that is installed
on your Hadoop system.
The SAS Embedded Process
must be installed on all nodes capable of executing either MapReduce
1 or MapReduce 2 tasks. For MapReduce 1, this would be nodes where
a TaskTracker is running. For MapReduce 2, this would be nodes where
a NodeManager is running. Usually, every DataNode node has a YARN
NodeManager or a MapReduce 1 TaskTracker running. By default, the
SAS Embedded Process install script (sasep-servers.sh) discovers the
cluster topology and installs the SAS Embedded Process on all DataNode
nodes, including the host node from where you run the script (the
Hadoop master NameNode). This occurs even if a DataNode is not present.
If you want to limit the list of nodes on which you want the SAS Embedded
Process installed, you should run the sasep-servers.sh script with
the -host
<hosts>
option. The SAS Hadoop MapReduce JAR files must be installed on all
nodes of a Hadoop cluster.