Running Scoring Models in Hadoop

To run a scoring model in Hadoop, follow these steps:
  1. Create a traditional scoring model by using SAS Enterprise Miner or an analytic store scoring model using the SAS Factory Miner HPFOREST or HPSVM component.
  2. Start SAS.
  3. [Optional] Create a metadata file for the input data file.
    The metadata file has the extension .sashdmd and must be stored in the HDFS. Use PROC HDMD to generate the metadata file.
    Note: You do not have to create a metadata file for the input data file if the data file is created with a Hadoop LIBNAME statement that contains the HDFS_DATADIR= and HDFS_METADIR options. In this instance, metadata files are automatically generated.
    Note: SAS/ACCESS requires Hadoop data to be in Hadoop standard UTF-8 format. If you are using DBCS encoding, you must extract the value of the character length in the engine-generated SASHDMD metadata file and multiply it by the number of bytes of a single character in order to create the correct byte length for the record.
    For more information, see Creating a Metadata File for the Input Data File and PROC HDMD in SAS/ACCESS for Relational Databases: Reference.
  4. Specify the Hadoop connection attributes.
    %let indconn= user=myuserid;
    For more information, see INDCONN Macro Variable.
  5. Run the %INDHD_PUBLISH_MODEL macro.
    With traditional model scoring, the %INDHD_PUBLISH_MODEL performs the following tasks using some of the files that are created by the SAS Enterprise Miner Score Code Export node: the scoring model program (score.sas file), the properties file (score.xml file), and (if the training data includes SAS user-defined formats) a format catalog:
    • translates the scoring model into the sasscore_modelname.ds2 file that is used to run scoring inside the SAS Embedded Process.
    • takes the format catalog, if available, and produces the sasscore_modelname_ufmt.xml file. This file contains user-defined formats for the scoring model that is being published.
    • uses SAS/ACCESS Interface to Hadoop to copy the sasscore_modelname.ds2 and sasscore_modelname_ufmt.xml scoring files to the HDFS.
    With analytic store scoring, the %INDHD_PUBLISH_MODEL macro takes the files that are created by the SAS Factory Miner HPFOREST or HPSVM components: the DS2 scoring model program (score.sas file), the analytic store file (score.sasast file), and (if the training data includes SAS user-defined formats) a format catalog, and copies them to the HDFS.
    For more information, see %INDHD_PUBLISH_MODEL Macro Syntax.
  6. Run the %INDHD_RUN_MODEL macro.
    The %INDHD_PUBLISH_MODEL macro publishes the model to Hadoop, making the model available to run against data that is stored in the HDFS.
    The %INDHD_RUN_MODEL macro starts a MapReduce job that uses the files generated by the %INDHD_PUBLISH_MODEL to execute the DS2 program. The MapReduce job stores the DS2 program output in the HDFS location that is specified by either the OUTPUTDATADIR= argument or by the <outputDir> element in the HDMD file.
    For more information, see %INDHD_RUN_MODEL Syntax.
  7. Submit an SQL query against the output file.
    For more information, see Scoring Output.