LIBNAME Statement Syntax
Associates a SAS libref with SASHDAT tables stored
in HDFS.
Valid in: |
Anywhere |
Category: |
Data Access |
Syntax
Required Arguments
libref
is a valid SAS name
that serves as a shortcut name to associate with the SASHDAT tables
that are stored in the Hadoop Distributed File System (HDFS). The
name must conform to the rules for SAS names. A libref cannot exceed
eight characters.
SASHDAT
is
the engine name for the SAS Data in HDFS engine.
Optional Arguments
COPIES=n
specifies the number
of replications to make for the data set (beyond the original blocks).
The default value is 2 when the INNAMEONLY option is specified and
otherwise is 1. Replicated blocks are used to provide fault tolerance
for HDFS. If a machine in the cluster becomes unavailable, then the
blocks needed for the SASHDAT file can be retrieved from replications
on other machines. If you specify COPIES=0, then the original blocks
are distributed, but no replications are made and there is no fault
tolerance for the data.
HOST="grid-host"
specifies the grid
host that has a running Hadoop NameNode. Enclose the host name in
quotation marks. If you do not specify the HOST= option, it is determined
from the GRIDHOST= environment variable.
INNAMEONLY= YES | NO
specifies that when
data is added to HDFS, that it should be sent as a single block to
the Hadoop NameNode for distribution. This option is appropriate for
smaller data sets.
INSTALL="grid-install-location"
specifies the path
to the TKGrid software on the grid host. If you do not specify this
option, it is determined from the GRIDINSTALLLOC= environment variable.
NODEFAULTFORMAT= YES | NO
specifies whether a
default format that is applied to a variable is reported by the engine.
If you do not specify
a format for a variable when you add a table to HDFS, the engine adds
a default format. The server applies BEST. for numeric variables and
$. for character variables.
The engine displays
this "forced" format in procedures that list variable attributes,
such as the CONTENTS and DATASETS procedures. If you specify NODEFAULTFORMAT=YES,
then the display of the "forced" format is suppressed.
Note: This setting does not control
whether formats are applied to a variable.
PATH="HDFS-path"
specifies the fully
qualified path to the HDFS directory to use for SASHDAT files. You
do not need to specify this option in the LIBNAME statement because
it can be specified as a data set option.
VERBOSE= YES | NO
specifies whether the
engine accepts and reports extra messages from TKGrid. For more information,
see the VERBOSE= option for the SAS LASR Analytic Server engine.
Examples
Example 1: Submitting a LIBNAME Statement Using the Defaults
Program
The following example
shows the code for connecting to a Hadoop NameNode with a LIBNAME
statement:
option set=GRIDHOST="grid001.example.com"; 1
option set GRIDINSTALLLOC="/opt/TKGrid";
libname hdfs sashdat; 2
NOTE: Libref HDFS was successfully assigned as follows:
Engine: SASHDAT
Physical Name: grid001.example.com
Program Description
-
The host name for the Hadoop NameNode is specified
in the GRIDHOST environment variable.
-
The LIBNAME statement uses host name from the GRIDHOST
environment variable and the path to TKGrid from the GRIDINSTALLLOC
environment variable. The PATH= and COPIES= options can be specified
as data set options.
Example 2: Submitting a LIBNAME Statement Using the HOST=, INSTALL=, and
PATH= Options
The following example
shows the code for submitting a LIBNAME statement with the HOST=,
INSTALL=, and PATH= options.
libname hdfs sashdat host="grid001.example.com" install="/opt/TKGrid"
path="/user/sasdemo";
NOTE: Libref HDFS was successfully assigned as follows:
Engine: SASHDAT
Physical Name: Directory '/user/sasdemo' of HDFS cluster on host
grid001.example.com
Example 3: Adding Tables to HDFS
The following code
sample demonstrates the LIBNAME statement and the
REPLACE= and
BLOCKSIZE= data
set options. The LABEL= data set option is common to many engines.
libname arch "/data/archive";
libname hdfs sashdat host="grid001.example.com" install="/opt/TKGrid"
path="/dept";
data hdfs.allyears(label="Sales records for previous years"
replace=yes blocksize=32m);
set arch.sales2012
arch.sales2011
...
;
run;
Example 4: Adding a Table to HDFS with Partitioning
The following code
sample demonstrates the PARTITION= and ORDERBY= data set options.
The rows are partitioned according to the unique combinations of the
formatted values for the Year and Month variables. Within each partition,
the rows are sorted by descending values of the Prodtype variable.
For
more information, see Data Partitioning and Ordering.
libname hdfs sashdat host="grid001.example.com" install="/opt/TKGrid"
path="/dept";
data hdfs.prdsale(partition=(year month) orderby=(descending prodtype));
set sashelp.prdsale;
run;
Example 5: Removing Tables from HDFS
Removing tables from
HDFS can be performed with the DATASETS procedure.
libname hdfs sashdat host="grid001.example.com" install="/opt/TKGrid"
path="/dept";
proc datasets lib=hdfs;
delete allyears;
run;
NOTE: Deleting HDFS.ALLYEARS (memtype=DATA).
Example 6: Creating a SASHDAT File from Another SASHDAT File
The following example
shows copying a data set from HDFS, adding a calculated variable,
and then writing the data to HDFS in the same library.
The BLOCKSIZE= data
set option is used to override the default 8-megabyte block size that
is created by SAS High-Performance Analytics procedures. The COPIES=0 data set option is used to specify that no
redundant blocks are created for the output SASHDAT file.
libname hdfs sashdat host="grid001.example.com" install="/opt/TKGrid"
path="/dept";
proc hpds2
in = hdfs.allyears(where=(region=212)) 1
out = hdfs.avgsales(blocksize=32m copies=0); 2
data DS2GTF.out;
dcl double avgsales;
method run();
set DS2GTF.in;
avgsales = avg(month1-month12);
end;
enddata;
run;
1 |
The
WHERE clause is used to subset the data in the input SASHDAT file.
|
2 |
The
BLOCKSIZE= and COPIES= options are used to override the default values.
|
Example 7: Working with CSV Files
The
comma-separated value (CSV) file format is a popular format for files
stored in HDFS. The SAS Data in HDFS engine can read these
files in parallel. The engine does not write CSV files.
List the Variables in a CSV File
The following example
shows how to access a CSV file in HDFS and use the CONTENTS procedure
to list the variables in the file. For this example, the first line
in the CSV file lists the variables names.
The GETNAMES data
set option is used to read them from the first line in the file.
libname csvfiles sashdat host="grid001.example.com" install="/opt/TKGrid"
path="/user/sasdemo/csv";
proc contents data=csvfiles.rep(filetype=csv getnames=yes);
run;
List the Variables in a CSV File with the CONTENTS Procedure
Convert a CSV File to SASHDAT
The engine is not designed
to transfer data from HDFS to a SAS client. As a consequence, the
contents of a CSV file can be accessed only by a SAS High-Performance
Analytics procedure that runs on the same cluster that is used for
HDFS. The SAS High-Performance Analytics procedures can read the data
because the procedures are designed to read data in parallel from
a co-located data provider.
The following code
sample shows how to convert a CSV file to a SASHDAT file with the
HPDS2 procedure.
option set=GRIDHOST="grid001.example.com"; 1
option set=GRIDINSTALLLOC="/opt/TKGrid";
libname csvfiles sashdat path="/user/sasdemo/csv";
proc hpds2 in=csvfiles.rep(filetype=csv getnames=yes) 2
out=csvfiles.rephdat(path="/user/sasdemo" copies=0 blocksize=32m); 3
data DS2GTF.out;
method run();
set DS2GTF.in;
end;
enddata;
run;
1 |
The
values for the GRIDHOST and GRIDINSTALLLOC environment variables are
read by the SAS Data in HDFS engine in the LIBNAME statement
and by the HPDS2 procedure.
|
2 |
The
FILETYPE=CSV data set option enables the engine to read the CSV file.
The GETNAMES= data set option is used to read the variable names from
the first line in the CSV file.
|
3 |
The PATH= data set
option is used to store the output as /user/sasdemo/rephdat.sashdat .
The COPIES=0 data set option is used to specify that no redundant
blocks are created for the rephdat.sashdat file.
|