Running Scoring Models in Hadoop

To run a scoring model in Hadoop, follow these steps:
  1. Create a scoring model using SAS Enterprise Miner.
  2. Start SAS.
  3. [Optional] Create a metadata file for the input data file.
    The metadata file has the extension .sashdmd and must be stored in the HDFS. Use PROC HDMD to generate the metadata file.
    Note: You do not have to create a metadata file for the input data file if the data file is created with a Hadoop LIBNAME statement that contains the HDFS_DATADIR= and HDFS_METADIR options. In this instance, metadata files are automatically generated.
    Note: SAS/ACCESS requires Hadoop data to be in Hadoop standard UTF-8 format. If you are using DBCS encoding, you must extract the value of the character length in the engine-generated SASHDMD metadata file and multiply it by the number of bytes of a single character in order to create the correct byte length for the record.
    For more information, see Creating a Metadata File for the Input Data File and PROC HDMD in SAS/ACCESS for Relational Databases: Reference.
  4. Connect to the HDFS using this command.
    %let indconn=hdfs_server=myhdfsserver hdfs_port=8020 user=myuserid;
    For more information, see INDCONN Macro Variable.
  5. Run the %INDHD_PUBLISH_MODEL macro.
    The %INDHD_PUBLISH_MODEL macro uses some of the files that the SAS Enterprise Miner Score Code Export node creates:
    • the scoring model program (score.sas file)
    • the properties file (score.xml file)
    • a format catalog (if the training data includes SAS user-defined formats)
    The %INDHD_PUBLISH_MODEL macro translates the score.sas file into a DS2 program and, if needed, generates an XML file for the user-defined formats. Then all model files (the SAS program, the DS2 program, the score.xml file, and the XML file for user-defined formats) are copied to the HDFS.
    For more information, see %INDHD_PUBLISH_MODEL Syntax.
  6. Connect to the MapReduce JobTracker using this command.
    %let indconn=hdfs_server=myhdfsserver hdfs_port=hdfsport 
       mapred_server=mapred-server-name mapred_port=mapred-port-number;
    For more information, see INDCONN Macro Variable.
  7. Run the %INDHD_RUN_MODEL macro.
    The %INDHD_PUBLISH_MODEL macro publishes the model to Hadoop, making the model available to run against data that is stored in the HDFS.
    The %INDHD_RUN_MODEL macro starts a MapReduce job that uses the files generated by the %INDHD_PUBLISH_MODEL to execute the DS2 program. The MapReduce job stores the DS2 program output in the HDFS location that is specified by either the OUTPUTDATADIR= argument or by the <outputDir> element in the HDMD file.
    For more information, see %INDHD_RUN_MODEL Syntax.
  8. Submit an SQL query against the output file.
    For more information, see Scoring Output.