Configure SAS High-Performance Deployment for Hadoop

Alternative: Commercial Distributions

If your site uses a commercial distribution of Hadoop, refer to the product documentation instead of the following sections.

Install a Java Runtime Environment

A Java Runtime Environment or Java Development Kit is necessary to run SAS High-Performance Deployment for Hadoop on the additional machines. Make sure it is installed it in the same path as the existing machines.

Install Hadoop on the Additional Machines

Install Hadoop on the additional machines with the same package, sashadoop.tar.gz, that is installed on the existing machines. Install the software with the same user account and use the same installation path as the existing machines.
Typically, you create a text file with the host names for all the machines in the cluster and supply it to the installation program. In this case, create a text file with the host names for the new machines only. Use a filename such as ~/addhosts.txt. When you run the installation program, hadoopInstall, supply the fully qualified path to the addhosts.txt file.
The previous tasks result in a new cluster that is independent of the existing cluster. When the configuration files are overwritten in the next step, the additional machines no longer belong to their own cluster. They become part of the existing cluster.
When you run the installation program on the new machines, if you are unsure of the directory paths to specify, you can view the following files on an existing machine:
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
Look for the values of the dfs.name.dir and dfs.data.dir properties.
$HADOOP_HOME/etc/hadoop/mapred-site.xml
Look for the values of the mapred.system.dir and mapred.local.dir properties.

Update the Hadoop Configuration Files

As the user ID that runs HDFS, modify the $HADOOP_HOME/etc/hadoop/slaves file on the existing machine that is used for the NameNode. Add the host names of the additional machines to the file.
You can use the simcp command to copy the file and other configuration files to the new machines:
/opt/webmin/utilbin/simcp -g newnodes $HADOOP_HOME/etc/hadoop/slaves
    $HADOOP_HOME/etc/hadoop/

/opt/webmin/utilbin/simcp -g newnodes $HADOOP_HOME/etc/hadoop/master
    $HADOOP_HOME/etc/hadoop/

/opt/webmin/utilbin/simcp -g newnodes $HADOOP_HOME/etc/hadoop/core-site.xml
    $HADOOP_HOME/etc/hadoop/

/opt/webmin/utilbin/simcp -g newnodes $HADOOP_HOME/etc/hadoop/hdfs-site.xml
    $HADOOP_HOME/etc/hadoop/

Start the DataNodes on the Additional Machines

For each of the new machines, run a command that is similar to the following example:
ssh hostname $HADOOP_HOME/sbin/hadoop-daemon.sh start datanode
Note: Run the command as the user account that is used for HDFS.
Make sure that you specify the actual path to hadoop-daemon.sh. After you start the DataNode process on each new machine, view the http://namenode-machine:50070/dfshealth.jsp page to view the number of live nodes.
Run $HADOOP_HOME/bin/hdfs dfsadmin -printTopology to confirm that the new machines are part of the cluster. The following listing shows a sample of the command output:
Rack: /default-rack
    192.168.8.148:50010 (grid103.example.com)
    192.168.8.153:50010 (grid104.example.com)
    192.168.8.217:50010 (grid106.example.com)
    192.168.8.230:50010 (grid105.example.com)
    192.168.9.158:50010 (grid099.example.com)
    192.168.9.159:50010 (grid100.example.com)
    192.168.9.160:50010 (grid101.example.com)

Copy a File to HDFS

If you put HDFS in safe mode at the beginning of this procedure, leave that state with a command that is similar to the following:
$HADOOP_HOME/bin/hdfs dfsadmin -safemode leave
To confirm that the additional machines are used, you can copy a file to HDFS and then list the locations of the blocks. Use a command that is similar to the following:
$HADOOP_HOME/bin/hadoop fs -D dfs.blocksize=512 -put /etc/fstab /hps
Note: The very small block size shown in the example is used to increase the number of blocks written and increase the likelihood that the new machines are used.
You can list the block locations with a command that is similar to the following:
$HADOOP_HOME/bin/hdfs fsck /hps/fstab -files -locations -blocks
Review the output to check for IP addresses for the new machines.
Connecting to namenode via http://0.0.0.0:50070
FSCK started by hdfs (auth:SIMPLE) from /192.168.9.156 for path /hps/fstab at Wed Jan 30 09:45:24 EST 2013
/hps/fstab 2093 bytes, 5 block(s): OK
0. BP-1250061202-192.168.9.156-1358965928729:blk_-2796832940080983787_1074 len=512 repl=2 [192.168.8.217:50010, 192.168.8.230:50010]
1. BP-1250061202-192.168.9.156-1358965928729:blk_-7759726019690621913_1074 len=512 repl=2 [192.168.8.230:50010, 192.168.8.153:50010]
2. BP-1250061202-192.168.9.156-1358965928729:blk_-6783529658608270535_1074 len=512 repl=2 [192.168.9.159:50010, 192.168.9.158:50010]
3. BP-1250061202-192.168.9.156-1358965928729:blk_1083456124028341178_1074 len=512 repl=2 [192.168.9.158:50010, 192.168.9.160:50010]
4. BP-1250061202-192.168.9.156-1358965928729:blk_-4083651737452524600_1074 len=45 repl=2 [192.168.8.230:50010, 192.168.8.153:50010]
Delete the sample file:
$HADOOP_HOME/bin/hadoop fs -rm /hps/fstab