Using the SAS Embedded Process to Run Scoring Models

How to Run a Scoring Model with the SAS Embedded Process

The integration of the SAS Embedded Process and Greenplum allows scoring code to run directly using the SAS Embedded Process on Greenplum.
Note: The SAS Embedded Process might require a later release of Greenplum than function-based scoring. For more information, see the SAS Foundation system requirements documentation for your operating environment.
To run the scoring model using the SAS Embedded Process, follow these steps.
  1. Create a table to hold the scoring files.
    The %INDGP_CREATE_MODELTABLE macro creates a table that holds the scoring files for the model that is being published.
    For more information, see Creating a Model Table.
  2. Run the %INDGP_PUBLISH_MODEL to create the scoring files.
    The %INDGP_PUBLISH_MODEL macro uses some of the files that are created by the SAS Enterprise Miner Score Code Export node: the scoring model program (score.sas file), the properties file (score.xml file), and (if the training data includes SAS user-defined formats) a format catalog.
    The %INDGP_PUBLISH_MODEL macro performs the following tasks:
    • translates the scoring model into the sasscore_modelname.ds2 file that is used to run scoring inside the SAS Embedded Process.
    • takes the format catalog, if available, and produces the sasscore_modelname_ufmt.xml file. This file contains user-defined formats for the scoring model that is being published.
    • uses the SAS/ACCESS Interface to Greenplum to insert the sasscore_modelname.ds2 and sasscore_modelname_ufmt.xml scoring files into the model table that was created using the %INDGP_CREATE_MODELTABLE macro.
  3. Use the SAS_EP function in the FROM clause in any SQL expression to run the scoring model.
    For more information, see SAS_EP Function.

Creating a Model Table

Overview

When using the SAS Embedded Process to publish a scoring model in Greenplum, you must create a table to hold the sasscore_modelname.ds2 and sasscore_modelname_ufmt.xml scoring files. You must run the %INDGP_CREATE_MODELTABLE macro to create the table before you run the %INDGP_PUBLISH_MODEL macro.
The model table contains the following columns. The ModelName column is the table key. The table is referenced by the two-level name schema-name.model-table-name.
Column Name
Description
Specification
ModelName
contains the name of the model
VARCHAR(128) NOT NULL PRIMARY KEY
ModelDS2
contains the sasscore_modelname.ds2 file
BYTEA NOT NULL
ModelFormats
contains the sasscore_modelname_ufmt.xml file
BYTEA
ModelMetadata
Reserved by SAS for future use
BYTEA
ModelUUID1
contains the UUID of the source model
VARCHAR (36)
Notes1
contains additional information that describes the source model
VARCHAR (512)
1This column is for use by SAS Model Manager. If you have a model table that was created prior to SAS 9.4 and you want this column, you must run the %INDGP_CREATE_MODELTABLE macro to re-create your model table.

%INDGP_CREATE_MODELTABLE Run Process

To run the %INDGP_CREATE_MODELTABLE macro, complete the following steps:
  1. Start SAS and submit the following command in the Program Editor or Enhanced Editor:
    %let indconn = user=youruserid password=yourpwd
        dsn=yourdsn schema=yourschema;
    
    For more information, see the INDCONN Macro Variable.
  2. Run the %INDGP_CREATE_MODELTABLE macro.
    For more information, see %INDGP_CREATE_MODELTABLE Macro Syntax.

%INDGP_CREATE_MODELTABLE Macro Syntax

Arguments

DATABASE=database-name

specifies the name of a Greenplum database where the sasscore_modelname.ds2 and sasscore_modelname_ufmt.xml scoring files are held.

Default The database specified in the INDCONN macro variable or your current database

MODELTABLE=model-table-name

specifies the name of the table that holds the sasscore_modelname.ds2 and sasscore_modelname_ufmt.xml scoring files.

Default sas_model_table
Requirements The maximum table name length is 63 characters and it must be a valid Greenplum table name.
The table name that you specify for this macro must be the same table name that is used in the %INDGP_PUBLISH_MODEL macro.
See %INDGP_PUBLISH_MODEL Macro Syntax

ACTION = CREATE | REPLACE | DROP

specifies one of the following actions that the macro performs:

CREATE

creates a new table.

Tip If the table has been previously defined and you specify ACTION=CREATE, an error is issued.

REPLACE

overwrites the current table, if a table with the same name is already registered.

Tip If you specify ACTION = REPLACE, and the current table contains sasscore_modelname.ds2 and sasscore_modelname_ufmt.xml files, the files are deleted and an empty table is re-created.

DROP

causes all models in this table to be dropped.

Default CREATE

SAS_EP Function

Overview of the SAS_EP Function

The SAS_EP function is the interface for running the scoring model inside Greenplum with the SAS Embedded Process. The SAS_EP function uses the information that is stored in the model table. The SAS_EP function is a built-in Greenplum function.

Using the SAS_EP Function

You can use the SAS_EP function using explicit pass-through and PROC SQL or you can use other Greenplum query tools such as psql. Use the SAS_EP function in the FROM clause in any SQL expression to run the scoring model.
Tip
Look at the SampleSQL.txt file that is produced when the %INDGP_PUBLISH_MODEL macro is successfully run. This file can be found in the output directory (OUTDIR argument) that you specify in the %INDGP_PUBLISH_MODEL macro. The SampleSQL.txt file contains basic SQL code that can be used to run your score code inside Greenplum. Please note that you must modify the sample code before using it. Otherwise, the sample code returns an error.
Note: Before using the SAS_EP function with the SAS Embedded Process, you must create the model table with the %INDGP_CREATE_MODELTABLE macro. Then, you must publish the files to the model table with the %INDGP_PUBLISH_MODEL macro. For more information, see Creating a Model Table and Running the %INDGP_PUBLISH_MODEL Macro.
Here is an example using PROC SQL.
%let indconn = user=user1 password=open1 dsn=dsn6 schema=GPschema;

%indgp_publish_model
  (dir= C:\models,
   modelname= almush02,
   action=create,
   mechanism=ep,
   outdir=C:\test
   );

proc sql noerrorstop;
connect to greenplm (user=user1 password=open1 dsn=dsn6 schema=GPschema);
create table test.dbscore as select * from connection to greenplm 
   (select id,
    "EM_CLASSIFICATION" ,
    "EM_EVENTPROBABILITY" ,
     "EM_PROBABILITY" from public.SAS_EP(TABLE(select id 
     ,"capcolor"
     ,"capsurf"
     ,"odor"
     ,"ringnumb"
     ,"sporepc"
     ,"stalkcbr"
     ,"stalksbr"
    from model.almush02),
      'select modelds2, modelformats from model.sas_model_table
         where upper(modelname)=''ALMUSH02''
      ');
    ) ;
quit;

SAS_EP Function Syntax

The basic syntax of the SAS_EP table function is as follows:
<SAS_EP-schema.>SAS_EP(TABLE (SELECT
* | column <, ... column-n>
, FROM <input-table-schema.>input-table-name
<SCATTER BY column<, ... column-n> | SCATTER RANDOMLY>
<ORDER BY column <, ... column-n>>),
'SELECT MODELDS2<, MODELFORMATS> FROM <schema.>model-table-name
WHERE MODELNAME =' 'model-name' '
');
Arguments

SAS_EP-schema

specifies the name of the schema where the SAS_EP function was created.

Note The SAS_EP function is created in the database by the %INDGP_PUBLISH_COMPILEUDF_EP macro. For more information, see SAS In-Database Products: Administrator’s Guide.

column

specifies the name of the column or columns that are read from the input table and passed to the SAS_EP function.

input-table-schema

specifies the name of the schema where the input table exists.

input-table-name

specifies the input table that is used by the SAS_EP function.

SCATTER BY column<, ...column-n > | SCATTER RANDOMLY

specifies how the input data is distributed.

ORDER BY column<, ...column-n >

specifies how the input data is sorted within its distribution.

model-table-name

specifies the name of the model table where the sasscore_modelname.ds2 and sasscore_modelname_ufmt.xml scoring files were published with the %INDGP_CREATE_MODELTABLE macro.

Requirement The table name that you specify for this function must be the same table name that is used in the %INDGP_CREATE_MODELTABLE macro. For more information, see %INDGP_CREATE_MODELTABLE Macro Syntax.

model-name

specifies the name of the model.

Greenplum Scoring Files

When using the SAS Embedded Process, the %INDGP_PUBLISH_MODEL macro produces two scoring files for each model:
  • sasscore_modelname.ds2. This file contains code that is executed by the SAS_EP function.
  • sasscore_modelname_ufmt.xml. This file contains user-defined formats for the scoring model that is being published. This file is used by the SAS_EP function.
These files are published to the model table that you specify in the %INDGP_PUBLISH_MODEL macro. See Scoring File Examples for an example of each of these files.
A message that indicates whether the scoring files are successfully or not successfully created is printed to the SAS log.
Although you cannot view the scoring files directly, there are two ways to see the models whose files are created:
  • Run this query from psql:
    select modelname from <schema.>sas-model-table;
  • Run a PROC SQL query from SAS.
    proc sql;
     connect to greenplm (user=userid password=xxxx dsn=mydsn schema=myschema);
      select * from connection to greenplm
     (select modelname from schema.sas_model_table);
     disconnect from greenplm; 
    quit;

Starting and Stopping the SAS Embedded Process

The SAS Embedded Process starts when a query is submitted using the SAS_EP function. It continues to run until it is manually stopped or the database is shut down.
Manually starting and stopping the SAS Embedded Process has implications for all scoring model publishers, requires superuser permissions, and must be run from the Greenplum master node. It should not be done without consulting your database administrator. For more information, see Controlling the SAS Embedded Process in SAS In-Database Products: Administrator’s Guide.

SAS Embedded Process Troubleshooting Tips

If you have problems running scoring models with the SAS Embedded Process, these are the most likely areas where a problem could occur:
  • Greenplum Partner Connector (GPPC) version 1.2 must be installed. You can verify that GPPC is installed by running this command.
    ls $GPHOME/lib/*gppc*
  • When you use the SAS_EP function in an SQL query, the schema name is either SASLIB or a schema that was specified when the SAS_EP function was registered. SASLIB is the default schema name for the INDCONN macro variable when the %INDGP_PUBLISH_COMPILEUDF_EP macro is run to create the SAS_EP function. For more information, see Running the %INDGP_PUBLISH_COMPILEUDF_EP Macro in SAS In-Database Products: Administrator’s Guide.
  • When you refer to the model table in an SQL query, the schema name is either the user ID or a schema that was specified when the model table was created. The user ID is the default schema name for the INDCONN macro variable when the %INDGP_PUBLISH_MODELTABLE macro is run to create the model table. For more information, see Creating a Model Table.
  • When you use the SAS_EP function, you must specify the schema where the SAS_EP function was registered.
  • $GPHOME can be referenced by a symbolic link or the explicit path. When you update the Greenplum version, it is safer to always use the explicit path. Here is an example.
    /usr/local/greenplum-db -> /usr/local/greenplum-db-4.2.2.0