FILENAME Statement, Hadoop Access Method

Enables you to access files on a Hadoop Distributed File System (HDFS) whose location is specified in a configuration file.
Valid in: Anywhere
Category: Data Access
Restriction: Access to Hadoop configurations on systems based on UNIX

Syntax

FILENAME fileref HADOOP 'external-file' <hadoop-options>;

Required Arguments

fileref
is a valid fileref.
Tip:The association between a fileref and an external file lasts only for the duration of the SAS session or until you change it or discontinue it with another FILENAME statement.
HADOOP
specifies the access method that enables you to use Hadoop to read from or write to a file from any host machine that you can connect to on a Hadoop configuration.
'external-file'
specifies the physical name of the file that you want to read from or write in an HDFS system. The physical name is the name that is recognized by the operating environment.
Operating environment:For details about specifying the physical names of external files, see the SAS documentation for your operating environment.
Tip:Specify external-file when you assign a fileref to an external file. You can associate a fileref with a single file or with an aggregate file storage location.

Hadoop Options

hadoop-options can be any of the following values:

BUFFERLEN=bufferlen
specifies the maximum buffer length of the data that is passed to Hadoop for its I/O operations.
Default:503808
Restriction:The maximum buffer length is 1000000.
Tip:Specifying a buffer length that is larger than the default could result in performance improvements.
CFG="physical-pathname-of-hadoop-configuration-file" | fileref-that-references-a-hadoop-configuration-file
specifies the configuration file that contains the connections settings for a specific Hadoop cluster.
CONCAT
specifies that the HDFS directory name that is specified on the FILENAME HADOOP statement is considered a wildcard specification. The concatenation of all the files in the directory is treated as a single logical file and read as one file.
Restriction:This works for input only.
Tip:For best results, do not concatenate text and binary files.
DIR
enables you to access files in an HDFS directory.
Requirement:You must use valid directory syntax for the specified host.
Interaction:Specify the HDFS directory name in the external-file argument.
ENCODING='encoding-value'
specifies the encoding to use when SAS is reading from or writing to an external file. The value for ENCODING= indicates that the external file has a different encoding from the current session encoding.
Default:SAS assumes that an external file is in the same encoding as the session encoding.
Note:When you read data from an external file, SAS transcodes the data from the specified encoding to the session encoding. When you write data to an external file, SAS transcodes the data from the session encoding to the specified encoding.
See:“Encoding Values in SAS Language Elements” in the SAS National Language Support (NLS): Reference Guide
FILEEXT
specifies that a file extension is automatically appended to the filename when you use the DIR option
Interaction:The autocall macro facility always passes the extension .SAS to the file access method as the extension to use when opening files in the autocall library. The DATA step always passes the extension .DATA. If you define a fileref for an autocall macro library and the files in that library have a file extension of .SAS, use the FILEEXT option. If the files in that library do not have an extension, do not use the FILEEXT option. For example, if you define a fileref for an input file in the DATA step and the file X has an extension of .DATA, you would use the FILEEXT option to read the file X.DATA. If you use the INFILE or FILE statement, enclose the member name and extension in quotation marks to preserve case.
Tip:The FILEEXT option is ignored if you specify a file extension on the FILE or INFILE statement.
LOWCASE_MEMNAME
enables autocall macro retrieval of lowercase directory or member names from HDFS systems.
Restriction:SAS autocall macro retrieval always searches for uppercase directory member names. Mixed-case directory or member names are not supported.
LRECL=logical-record-length
specifies the logical record length of the data.
Default:65536
Interaction:Alternatively, you can specify a global logical record length by using the LRECL= system option. For more information, see SAS System Options: Reference
MOD
places the file in Update mode and appends updates to the bottom of the file.
PASS='password'
specifies the password to use with the user name that is specified in the USER option.
Requirement:The password is case sensitive and it must be enclosed in single or double quotation marks.
Tip:To use an encoded password, use the PWENCODE procedure in order to disguise the text string, and then enter the encoded password for the PASS= option. For more information see the PWENCODE procedure in the Base SAS Procedures Guide.
PROMPT
specifies to prompt for the user login, the password, or both, if necessary.
Interaction:The USER= and PASS= options override the PROMPT option if all three options are specified. If you specify the PROMPT option and do not specify the USER= or PASS= option, you are prompted for a user ID and password.
RECFM=record-format
where record-format is one of three record formats:
S
is stream-record format. Data is read in binary mode.
Tip:The amount of data that is read is controlled by the current LRECL value or the value of the NBYTE= variable in the INFILE statement. The NBYTE= option specifies a variable that is equal to the amount of data to be read. This amount must be less than or equal to LRECL. To avoid problems when you read large binary files like PDF or GIF, set NBYTE=1 to read one byte at a time.
See:The NBYTE= option in the INFILE statement in the SAS Statements: Reference
F
is fixed-record format. In this format, records have fixed lengths, and they are read in binary mode.
V
is variable-record format (the default). In this format, records have varying lengths, and they are read in text (stream) mode.
Tip:Any record larger than LRECL is truncated.
Default:V
USER='username'
where username is used to log on to the Hadoop system.
Requirement: The user name is case sensitive and it must be enclosed in single or double quotation marks.

Details

An HDFS system has defined levels of permissions at both the directory and file level. The Hadoop access method honors those permissions. For example, if a file is available as read-only, you cannot modify it.
Operating Environment Information: Using the FILENAME statement requires information that is specific to your operating environment. The Hadoop access method is fully documented here. For more information about how to specify filenames, see the SAS documentation for your operating environment.

Examples

Example 1: Writing to a New Member of a Directory

This example writes the file shoes to the directory testing.
filename out hadoop '/user/testing/' cfg=”/path/cfg.xml”  user='xxxx'
  pass='xxxx' recfm=v  lrecl=32167 dir ;
  
data _null_;
   file out(shoes) ;
   put 'write data to shoes file';
run;

Example 2: Creating and Using a Configuration File

This example accesses the file acctdata.dat at site xxx.unx.sas.com. The configuration file is accessed from the “cfg” fileref assignment.
filename cfg  'U:/test.cfg';

data _null_;
   file cfg;
   input;
   put _infile_;
   datalines4;
<configuration>
<property>
   <name>fs.default.name</name>
   <value>hdfs://xxx.unx.sas.com:8020</value>
</property>
</property>
   <name>mapred.job.tracker</name>
   <value>xxx.unx.sas.com:8021</value>
</property>
</configuration>
 
;;;;

filename foo hadoop '/user/xxxx/acctdata.dat' cfg=cfg user='xxxx'
    pass='xxxx' debug recfm=s lrecl=65536 bufferlen=65536;
  
data _null_;
   infile foo truncover;
   input a $1024.;
   put a;
run;

Example 3: Buffering 1MB of Data during a File Read

This example uses the BUFFERLEN option to buffer 1MB of data at time during the file read. The records of length 1024 are read from this buffer.
filename foo hadoop 'file1.dat' cfg='U=/hadoopcfg.xml'
   user='user' pass='apass' recfm=s
   lrecl=1024 bufferlen=1000000;

data _null_;
   infile foo truncover;
input a $1024.;
put a;
run;

Example 4: Using the CONCAT Option

This example uses the CONCAT option to read all members of DIRECTORY1 as if they are one file.
filename foo hadoop '/directory1/' cfg='U=/hadoopcfg.xml'
   user='user' pass='apass' recfm=s lrecl=1024 concat;

data _null_;
   infile foo truncover;
input a $1024.;
put a;
run;