Hadoop is general-purpose
data storage and computing platform that includes database-like tools,
such as Hive and HiveServer2. SAS/ACCESS Interface to Hadoop lets
you work with your data using SQL constructs through Hive and HiveServer2.
It also lets you access data directly from the underlying data storage
layer, the Hadoop Distributed File System (HDFS). This differs from
the traditional SAS/ACCESS engine behavior, which exclusively uses
database SQL to read and write data.
Cloudera Impala is an
open-source, massively parallel processing (MPP) query engine that
runs natively on Apache Hadoop. You can use it to issue SQL queries
to data stored in HDFS and Apache Hbase without moving or transforming
data. Similar to other SAS/ACCESS engines, SAS/ACCESS Interface to
Impala lets you run SAS procedures against data that is stored in
Impala and returns the results to SAS.
You can read and write
data to and from Hadoop as if they were any other relational data
source to which SAS can connect with SAS/ACCESS Interface to Hadoop
and SAS/ACCESS Interface to Impala . SAS/ACCESS Interface to Hadoop
for Hive and Hive Server2 provides fast, efficient access to data
stored in Hadoop through HiveQL. You can access Hive tables as if
they were native SAS data sets and then analyze them using SAS.
You can also get virtual
views of Hadoop data without physically moving it. SAS Federation
Server offers simplified data access, administration, security, and
performance by creating a virtual data layer without physically moving
data. This frees business users from the complexities of the Hadoop
environment. They can view data in Hadoop and virtually blend it with
other database systems such as SAP HANA, IBM DB2, Oracle, or Teradata.
Improved security and governance features, such as dynamic data masking,
ensure that the right users have access to the right data.
The HADOOP procedure
enables SAS to interact with Hadoop data by running Apache Hadoop
code. Apache Hadoop is an open-source framework that is written in
Java and provides distributed data storage and processing of large
amounts of data.
The HADOOP procedure
interfaces with the Hadoop JobTracker. This is the service within
Hadoop that controls tasks to specific nodes in the cluster.
PROC HADOOP enables
you to submit the following:
-
Hadoop Distributed File System
(HDFS) commands
-
-