SAS Data Loader for Hadoop is a software offering that
includes SAS Data Loader, SAS/ACCESS Interface to Hadoop, SAS In-Database
Code Accelerator for Hadoop and SAS Data Quality Accelerator for Hadoop.
The following diagram illustrates an installed configuration.
SAS Data Loader for
Hadoop runs inside a virtual machine or vApp. The vApp is a complete
and isolated operating environment that is accessed through a web
browser. Each instance of SAS Data Loader for Hadoop is accessed by
a single user. The vApp is started and stopped by a hypervisor application
such as VMware Player Pro. SAS Data Loader for Hadoop uses SAS software
in the vApp and on the Hadoop cluster to manage data within Hadoop.
The hypervisor provides
a web (HTTP) address that you enter into a web browser. The web address
opens the SAS Data Loader: Information Center. The Information Center
does the following:
-
starts SAS Data Loader in a new
browser tab.
-
provides a Settings window
to configure the vApp connection to Hadoop.
-
checks for available vApp software
updates and installs vApp software updates.
All of the
files that are accessed by the vApp reside in the shared folder. The
shared folder is the only location on the user host that is accessed
by the vApp. The shared folder contains the JDBC drivers needed to
connect to external databases, and the Hadoop JAR files that were
copied to the client from the Hadoop cluster.
When you create a job
using a directive, the web application generates code that is then
sent to the Hadoop cluster for execution. When the job is complete,
the Hadoop cluster writes data to the target file and delivers log
and status information to the vApp. Saved directives are stored in
a database within the vApp.
The SAS In-Database
Technologies for Hadoop software is deployed to each node in the Hadoop
cluster. The in-database technologies consist of the following components:
-
SAS Quality Knowledge Base for
reference to data cleansing definitions.
-
SAS Embedded Process software for
code acceleration.
-
SAS Data Quality Accelerator software
supports SAS DS2 methods that pertain to data cleansing.
-
SAS Data Management Accelerator
for Spark enables you execute data integration and data quality tasks
in Apache Spark on a Hadoop cluster.