SAS/ACCESS Interface to Hadoop and SAS/ACCESS Interface to Impala

Hadoop is general-purpose data storage and computing platform that includes database-like tools, such as Hive and HiveServer2. SAS/ACCESS Interface to Hadoop lets you work with your data using SQL constructs through Hive and HiveServer2. It also lets you access data directly from the underlying data storage layer, the Hadoop Distributed File System (HDFS). This differs from the traditional SAS/ACCESS engine behavior, which exclusively uses database SQL to read and write data.

Cloudera Impala is an open-source, massively parallel processing (MPP) query engine that runs natively on Apache Hadoop. You can use it to issue SQL queries to data stored in HDFS and Apache Hbase without moving or transforming data. Similar to other SAS/ACCESS engines, SAS/ACCESS Interface to Impala lets you run SAS procedures against data that is stored in Impala and returns the results to SAS.

You can read and write data to and from Hadoop as if they were any other relational data source to which SAS can connect with SAS/ACCESS Interface to Hadoop and SAS/ACCESS Interface to Impala . SAS/ACCESS Interface to Hadoop for Hive and Hive Server2 provides fast, efficient access to data stored in Hadoop through HiveQL. You can access Hive tables as if they were native SAS data sets and then analyze them using SAS.

You can also get virtual views of Hadoop data without physically moving it. SAS Federation Server offers simplified data access, administration, security, and performance by creating a virtual data layer without physically moving data. This frees business users from the complexities of the Hadoop environment. They can view data in Hadoop and virtually blend it with other database systems such as SAP HANA, IBM DB2, Oracle, or Teradata. Improved security and governance features, such as dynamic data masking, ensure that the right users have access to the right data.

The HADOOP procedure enables SAS to interact with Hadoop data by running Apache Hadoop code. Apache Hadoop is an open-source framework that is written in Java and provides distributed data storage and processing of large amounts of data.

The HADOOP procedure interfaces with the Hadoop JobTracker. This is the service within Hadoop that controls tasks to specific nodes in the cluster.

PROC HADOOP enables you to submit the following:

Hadoop Distributed File System (HDFS) commands
MapReduce programs
Pig language code