To run a scoring model
in Hadoop, follow these steps:
-
Create a scoring model
using SAS Enterprise Miner.
-
-
[Optional] Create a
metadata file for the input data file.
The metadata file has
the extension .sashdmd and must be stored in the HDFS. Use PROC HDMD
to generate the metadata file.
Note: You do not have to create
a metadata file for the input data file if the data file is created
with a Hadoop LIBNAME statement that contains the HDFS_DATADIR= and
HDFS_METADIR options. In this instance, metadata files are automatically
generated.
Note: SAS/ACCESS requires Hadoop
data to be in Hadoop standard UTF-8 format. If you are using DBCS
encoding, you must extract the value of the character length in the
engine-generated SASHDMD metadata file and multiply it by the number
of bytes of a single character in order to create the correct byte
length for the record.
-
Connect to the HDFS
using this command.
%let indconn=hdfs_server=myhdfsserver hdfs_port=8020 user=myuserid;
-
Run the
%INDHD_PUBLISH_MODEL macro.
The
%INDHD_PUBLISH_MODEL macro uses some of the files that the SAS Enterprise
Miner Score Code Export node creates:
-
the scoring model program (score.sas
file)
-
the properties file (score.xml
file)
-
a format catalog (if the training
data includes SAS user-defined formats)
The %INDHD_PUBLISH_MODEL macro translates the score.sas file into a DS2 program
and, if needed, generates an XML file for the user-defined formats.
Then all model files (the SAS program, the DS2 program, the score.xml
file, and the XML file for user-defined formats) are copied to the
HDFS.
-
Connect to the MapReduce
JobTracker using this command.
%let indconn=hdfs_server=myhdfsserver hdfs_port=hdfsport
mapred_server=mapred-server-name mapred_port=mapred-port-number;
-
Run the
%INDHD_RUN_MODEL macro.
The %INDHD_PUBLISH_MODEL macro publishes the model to Hadoop, making the model
available to run against data that is stored in the HDFS.
The %INDHD_RUN_MODEL macro starts a MapReduce job that uses the files generated
by the %INDHD_PUBLISH_MODEL
to execute the DS2 program. The MapReduce job stores the DS2 program
output in the HDFS location that is specified by either the OUTPUTDATADIR=
argument or by the <outputDir> element in the HDMD file.
-
Submit an SQL query
against the output file.