is a valid SAS name that serves as a shortcut name to associate with the SASHDAT tables that are stored in the Hadoop Distributed File System (HDFS). The name must conform to the rules for SAS names. A libref cannot exceed eight characters.

SASHDAT

is the engine name for the SAS Data in HDFS engine.

Optional Arguments

COPIES=n

specifies the number of replications to make for the data set (beyond the original blocks). The default value is 2 when the INNAMEONLY option is specified and otherwise is 1. Replicated blocks are used to provide fault tolerance for HDFS. If a machine in the cluster becomes unavailable, then the blocks needed for the SASHDAT file can be retrieved from replications on other machines. If you specify COPIES=0, then the original blocks are distributed, but no replications are made and there is no fault tolerance for the data.

HOST="grid-host"

specifies the grid host that has a running Hadoop NameNode. Enclose the host name in quotation marks. If you do not specify the HOST= option, it is determined from the GRIDHOST= environment variable.

INNAMEONLY= YES | NO

specifies that when data is added to HDFS, that it should be sent as a single block to the Hadoop NameNode for distribution. This option is appropriate for smaller data sets.

Alias

NODIST

INSTALL="grid-install-location"

specifies the path to the TKGrid software on the grid host. If you do not specify this option, it is determined from the GRIDINSTALLLOC= environment variable.

NODEFAULTFORMAT= YES | NO

specifies whether a default format that is applied to a variable is reported by the engine.

If you do not specify a format for a variable when you add a table to HDFS, the engine adds a default format. The server applies BEST. for numeric variables and $. for character variables.

The engine displays this "forced" format in procedures that list variable attributes, such as the CONTENTS and DATASETS procedures. If you specify NODEFAULTFORMAT=YES, then the display of the "forced" format is suppressed.

Note: This setting does not control whether formats are applied to a variable.

PATH="HDFS-path"

specifies the fully qualified path to the HDFS directory to use for SASHDAT files. You do not need to specify this option in the LIBNAME statement because it can be specified as a data set option.

VERBOSE= YES | NO

specifies whether the engine accepts and reports extra messages from TKGrid. For more information, see the VERBOSE= option for the SAS LASR Analytic Server engine.

Examples

Example 1: Submitting a LIBNAME Statement Using the Defaults

Program

The following example shows the code for connecting to a Hadoop NameNode with a LIBNAME statement:

option set=GRIDHOST="grid001.example.com"; 1
option set GRIDINSTALLLOC="/opt/TKGrid";

libname hdfs sashdat; 2

NOTE: Libref HDFS was successfully assigned as follows:
      Engine:        SASHDAT
      Physical Name: grid001.example.com

Program Description

The host name for the Hadoop NameNode is specified in the GRIDHOST environment variable.
The LIBNAME statement uses host name from the GRIDHOST environment variable and the path to TKGrid from the GRIDINSTALLLOC environment variable. The PATH= and COPIES= options can be specified as data set options.

Example 2: Submitting a LIBNAME Statement Using the HOST=, INSTALL=, and PATH= Options

The following example shows the code for submitting a LIBNAME statement with the HOST=, INSTALL=, and PATH= options.

libname hdfs sashdat host="grid001.example.com" install="/opt/TKGrid"
    path="/user/sasdemo";

NOTE: Libref HDFS was successfully assigned as follows:
      Engine:        SASHDAT
      Physical Name: Directory '/user/sasdemo' of HDFS cluster on host 
                     grid001.example.com

Example 3: Adding Tables to HDFS

The following code sample demonstrates the LIBNAME statement and the REPLACE= and BLOCKSIZE= data set options. The LABEL= data set option is common to many engines.

libname arch "/data/archive";
libname hdfs sashdat host="grid001.example.com" install="/opt/TKGrid" 
    path="/dept";

data hdfs.allyears(label="Sales records for previous years"
                   replace=yes blocksize=32m);
  set arch.sales2012
      arch.sales2011
      ...
  ;
run;

Example 4: Adding a Table to HDFS with Partitioning

The following code sample demonstrates the PARTITION= and ORDERBY= data set options. The rows are partitioned according to the unique combinations of the formatted values for the Year and Month variables. Within each partition, the rows are sorted by descending values of the Prodtype variable. For more information, see Data Partitioning and Ordering.

libname hdfs sashdat host="grid001.example.com" install="/opt/TKGrid"
    path="/dept";

data hdfs.prdsale(partition=(year month) orderby=(descending prodtype));
    set sashelp.prdsale;
run;

Example 5: Removing Tables from HDFS

Removing tables from HDFS can be performed with the DATASETS procedure.

libname hdfs sashdat host="grid001.example.com" install="/opt/TKGrid"
    path="/dept";

proc datasets lib=hdfs;
    delete allyears;
run;

NOTE: Deleting HDFS.ALLYEARS (memtype=DATA).

Example 6: Creating a SASHDAT File from Another SASHDAT File

The following example shows copying a data set from HDFS, adding a calculated variable, and then writing the data to HDFS in the same library. The BLOCKSIZE= data set option is used to override the default 8-megabyte block size that is created by SAS High-Performance Analytics procedures. The COPIES=0 data set option is used to specify that no redundant blocks are created for the output SASHDAT file.

libname hdfs sashdat host="grid001.example.com" install="/opt/TKGrid"
    path="/dept";

proc hpds2 
    in = hdfs.allyears(where=(region=212)) 1
    out = hdfs.avgsales(blocksize=32m copies=0); 2

    data DS2GTF.out;
        dcl double avgsales;
        method run();
        set DS2GTF.in;
            avgsales = avg(month1-month12);
        end;
    enddata;
run;

1	The WHERE clause is used to subset the data in the input SASHDAT file.
2	The BLOCKSIZE= and COPIES= options are used to override the default values.

Example 7: Working with CSV Files

The comma-separated value (CSV) file format is a popular format for files stored in HDFS. The SAS Data in HDFS engine can read these files in parallel. The engine does not write CSV files.

List the Variables in a CSV File

The following example shows how to access a CSV file in HDFS and use the CONTENTS procedure to list the variables in the file. For this example, the first line in the CSV file lists the variables names. The GETNAMES data set option is used to read them from the first line in the file.

libname csvfiles sashdat host="grid001.example.com" install="/opt/TKGrid"
    path="/user/sasdemo/csv";

proc contents data=csvfiles.rep(filetype=csv getnames=yes);
run;

List the Variables in a CSV File with the CONTENTS Procedure

The CONTENTS procedure for a CSV file in HDFS.

Convert a CSV File to SASHDAT

The engine is not designed to transfer data from HDFS to a SAS client. As a consequence, the contents of a CSV file can be accessed only by a SAS High-Performance Analytics procedure that runs on the same cluster that is used for HDFS. The SAS High-Performance Analytics procedures can read the data because the procedures are designed to read data in parallel from a co-located data provider.

The following code sample shows how to convert a CSV file to a SASHDAT file with the HPDS2 procedure.

option set=GRIDHOST="grid001.example.com"; 1
option set=GRIDINSTALLLOC="/opt/TKGrid";

libname csvfiles sashdat path="/user/sasdemo/csv";

proc hpds2 in=csvfiles.rep(filetype=csv getnames=yes) 2
    out=csvfiles.rephdat(path="/user/sasdemo" copies=0 blocksize=32m); 3

    data DS2GTF.out;
        method run();
            set DS2GTF.in;
        end;
    enddata;
run;

1	The values for the GRIDHOST and GRIDINSTALLLOC environment variables are read by the SAS Data in HDFS engine in the LIBNAME statement and by the HPDS2 procedure.
2	The FILETYPE=CSV data set option enables the engine to read the CSV file. The GETNAMES= data set option is used to read the variable names from the first line in the CSV file.
3	The PATH= data set option is used to store the output as `/user/sasdemo/rephdat.sashdat`. The COPIES=0 data set option is used to specify that no redundant blocks are created for the rephdat.sashdat file.