HADOOP Procedure

Concepts: HADOOP Procedure

Using PROC HADOOP

The HADOOP procedure enables you to submit HDFS commands, MapReduce programs, and Pig language code against Hadoop data. To connect to the Hadoop server, a Hadoop configuration file is required that specifies the name and JobTracker addresses for the specific server. fs.default.name is the URI (protocol specifier, host name, and port) that describes the NameNode for the cluster. Here is an example of a Hadoop configuration file:
<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://xxx.us.company.com:8020</value>
   </property>
   <property>
      <name>mapred.job.tracker</name>
      <value>xxx.us.company.com:8021</value>
   </property>
</configuration>
The PROC HADOOP statement supports several options that control access to the Hadoop server. For example, you can identify the Hadoop configuration file and specify the user ID and password on the Hadoop server. You can specify the server options on all PROC HADOOP statements. For the list of server options, see Hadoop Server Options.

Submitting Hadoop Distributed File System Commands

The Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable file system for the Hadoop framework. HDFS is designed to hold large amounts of data and provide access to the data to many clients that are distributed across a network.
The PROC HADOOP HDFS statement submits HDFS commands to the Hadoop server that are like the Hadoop shell commands that interact with HDFS and manipulate files. For the list of HDFS commands, see HDFS Statement.

Submitting MapReduce Programs

MapReduce is a parallel processing framework that enables developers to write programs to process vast amounts of data. There are two types of key functions in the MapReduce framework: the Map function that separates the data to be processed into independent chunks and the Reduce function that performs analysis on that data.
The PROC HADOOP MAPREDUCE statement submits a MapReduce program into a Hadoop cluster. See MAPREDUCE Statement.

Submitting Pig Language Code

The Apache Pig language is a high-level programming language that creates MapReduce programs that are used with Hadoop.
The PROC HADOOP PIG statement submits Pig language code into a Hadoop cluster. See PIG Statement.