Using Scoring Functions to Run Scoring Models

How to Run a Scoring Model Using Scoring Functions

The %INDGP_PUBLISH_MODEL macro creates the files that are needed to build the scoring functions. The macro then publishes the scoring functions with those files to a specified database in Greenplum. Only the EM_ output variables are published as Greenplum scoring functions. For more information about the EM_ output variables, see Fixed Variable Names.
To run the scoring model using scoring functions, follow these steps.
  1. Run the %INDGP_PUBLISH_MODEL macro. The %INDGP_PUBLISH_MODEL macro uses some of the files that are created by the SAS Enterprise Miner Score Code Export node: the scoring model program (score.sas file), the properties file (score.xml file), and (if the training data includes SAS user-defined formats) a format catalog.
    The %INDGP_PUBLISH_MODEL macro performs the following tasks:
    • takes the score.sas and score.xml files and produces the set of .c and .h files. These .c and .h files are necessary to build separate scoring functions for each of a fixed set of quantities that can be computed by the scoring model code.
    • processes the format catalog, if a format catalog is available, and creates an .h file with C structures, which are also necessary to build the scoring functions.
    • produces a script of the Greenplum commands that are used to register the scoring functions in the Greenplum database.
    • transfers the .c and .h files to Greenplum.
    • calls the SAS_COMPILEUDF function to compile the source files into object files and links to the SAS formats library.
    • calls the SAS_COPYUDF function to copy the new object files to full-path-to-pkglibdir/SAS on the whole database array (master and all segments), where full-path-to-pkglibdir is the path that was defined during installation.
    • uses the SAS/ACCESS Interface to Greenplum to run the script to create the scoring functions with the object files.
    The scoring functions are registered in Greenplum with shared object files. These shared object files are loaded at run time. These functions are stored in a permanent location. The SAS object files and the SAS formats library are stored in the full-path-to-pkglibdir/SAS directory on all nodes, where full-path-to-pkglibdir is the path that was defined during installation.
    Greenplum caches the object files within a session.
    Note: You can publish scoring model files with the same model name in multiple databases and schemas. Because all model object files for the SAS scoring function are stored in the full-path-to-pkglibdir/SAS directory, the publishing macros use the database, schema, and model name as the object filename to avoid potential naming conflicts.
  2. Use the scoring functions in any SQL query.

Scoring Function Names

The names of the scoring functions that are built in Greenplum have the following format:
modelname_EM_outputvarname
modelname is the name that was specified in the MODELNAME argument of the %INDGP_PUBLISH_MODEL macro. modelname is always followed by _EM_ in the scoring function name. For more information about the MODELNAME argument, see %INDGP_PUBLISH_MODEL Macro Syntax.
outputvarname is derived from the names of the EM_ output variables in the score.xml file that is generated from the SAS Enterprise Miner Score Code Export node. For more information about the score.xml file, see Fixed Variable Names.
One scoring function is created for each EM_ output variable in the score.xml file. For example, if the scoring model DATA step program takes ten inputs and creates three new variables, then three scoring functions are defined. Each scoring function has the name of an output variable. For example, if you set MODELNAME=credit in the %INDGP_PUBLISH_MODEL macro and the EM_ output variables are “EM_PREDICTION”, “EM_PROBABILITY”, and “EM_DECISION”, then the name of the scoring functions that are created would be “credit_EM_PREDICTION”, “credit_EM_PROBABILITY”, and “credit_EM_DECISION”.
Note: A scoring function name cannot exceed 63 characters.
CAUTION:
When the scoring function is generated, the names are case insensitive.
Consequently, if you have model names “Model01” and “model01”, and you create two scoring functions, the second scoring function overwrites the first scoring function.

Viewing the Scoring Functions

The scoring functions are available to use in any SQL expression in the same way that Greenplum built-in functions are used. For an example, see Using Scoring Functions to Run a Scoring Model.
Tip
In Greenplum, character variables have a length of 32K. If you create an output table or data set to hold the scored rows, it is recommended that you create the table and define the variables. Here is an example.
proc sql noerrorstop;
connect to greenplm (<connection options>);
execute (create table scoretab (
   ID                    integer 
   , EM_SEGMENT            float
   , EM_EVENTPROBABILITY   float
   , EM_PROBABILITY        float
   , EM_CLASSIFICATION     varchar (32)
   )
   distributed by (id)
) by greenplm;
execute ( insert into scoretab 
select id,
function prefix_EM_SEGMENT (
   comma-delimited input column list
   ) as "EM_ SEGMENT",
function prefix_EM_EVENTPROBABILITY (
   comma-delimited input column list
   ) as "EM_EVENTPROBABILITY",
function prefix_EM_PROBABILITY (
   comma-delimited input column list
   ) as "EM_PROBABILITY"
cast(function prefix_EM_CLASSIFICATION (
   comma-delimited input column list
   ) as varchar(32)) as "EM_CLASSIFICATION",
from scoring_input_table  
   order by id
) by greenplm;
quit;
There are four ways to see the scoring functions that are created:
  • From Greenplum, start psql to connect to the database and submit an SQL statement. In this example, 'SCHEMA' is the actual schema value.
    psql -h hostname -d databasename -U userid
    select proname
       from pg_catalog.pg_proc f, pg_catalog.pg_namespace s
       where f.pronamespace=s.oid and upper(s.nspname)='SCHEMA';
  • From SAS, use SQL procedure code that produces output in the LST file. The following example assumes that the model name that you used to create the scoring functions is mymodel.
    proc sql noerrorstop;
      connect to greenplm (user=username pw=password dsn= dsnname);
      
    select *
       from connection to greenplm
          (select proname
           from pg_catalog.pg_proc f, pg_catalog.pg_namespace s
           where f.pronamespace=s.oid and upper(s.nspname)='SCHEMA');
       disconnect from greenplm;
    quit;
    
  • Look at the SampleSQL.txt file that is produced when the %INDGP_PUBLISH_MODEL macro is successfully run. This file can be found in the output directory (OUTDIR argument) that you specify in the macro.
    The SampleSQL.txt file contains basic SQL code that can be used to run your score code inside Greenplum. Please note that you must modify the sample code before using it. Otherwise, the sample code returns an error.
    For example, the SampleSQL.txt file refers to an ID column in allmush1_intab that is populated with a unique integer from 1 to n. n is the number of rows in the table.
    Note: The ID column uniquely identifies each row. You would replace the ID column with your own primary key column.
    Note: The function and table names must be fully qualified if the function and table are not in the same schema.
    The following example assumes that the model name that you used to create the scoring functions is allmush1.
    drop table allmush1_outtab;
    create table allmush1_outtab(
     id integer
    ,"EM_CLASSIFICATION" varchar(33)
    ,"EM_EVENTPROBABILITY" float
    ,"EM_PROBABILITY" float
    );
    insert into allmush1_outtab(
     id
    ,"EM_CLASSIFICATION"
    ,"EM_EVENTPROBABILITY"
    ,"EM_PROBABILITY"
    )
    select id,
     allmush1_em_classification("BRUISES"
    ,"CAPCOLOR"
    ,"GILLCOLO"
    ,"GILLSIZE"
    ,"HABITAT"
    ,"ODOR"
    ,"POPULAT"
    ,"RINGNUMB"
    ,"RINGTYPE"
    ,"SPOREPC"
    ,"STALKCBR"
    ,"STALKROO"
    ,"STALKSAR"
    ,"STALKSHA"
    ,"VEILCOLO")
      as "EM_CLASSIFICATION",
     allmush1_em_eventprobability("BRUISES"
    ,"CAPCOLOR"
    ,"GILLCOLO"
    ,"GILLSIZE"
    ,"HABITAT"
    ,"ODOR"
    ,"POPULAT"
    ,"RINGNUMB"
    ,"RINGTYPE"
    ,"SPOREPC"
    ,"STALKCBR"
    ,"STALKROO"
    ,"STALKSAR"
    ,"STALKSHA"
    ,"VEILCOLO")
      as "EM_EVENTPROBABILITY",
     allmush1_em_probability("BRUISES"
    ,"CAPCOLOR"
    ,"GILLCOLO"
    ,"GILLSIZE"
    ,"HABITAT"
    ,"ODOR"
    ,"POPULAT"
    ,"RINGNUMB"
    ,"RINGTYPE"
    ,"SPOREPC"
    ,"STALKCBR"
    ,"STALKROO"
    ,"STALKSAR"
    ,"STALKSHA"
    ,"VEILCOLO")
      as "EM_PROBABILITY"
    from allmush1_intab ;
    
  • You can look at the SAS log that is created when the %INDGP_PUBLISH_MODEL macro was run. A message that indicates whether a scoring function is successfully or not successfully executed is printed to the SAS log.

Using Scoring Functions to Run a Scoring Model

The scoring functions are available to use in any SQL expression in the same way that Greenplum built-in functions are used.
The following example code creates the scoring functions.
%let indconn = user=user1 password=open1 dsn=green6 schema=myschema;
%indgp_publish_model(dir=C:\SASIN\baseball1, modelname=baseball1, outdir=C:\test);
The %INDGP_PUBLISH_MODEL macro produces a text file of Greenplum CREATE FUNCTION commands as shown in the following example.
Note: This example file is shown for illustrative purposes. The text file that is created by the %INDGP_PUBLISH_MODEL macro cannot be viewed and is deleted after the macro is complete.
CREATE FUNCTION baseball1_EM_eventprobablility
(
"CR_ATBAT" float,
"CR_BB" float,
"CR_HITS" float,
"CR_HOME" float,
"CR_RBI" float,
"CR_RUNS" float,
"DIVISION" varchar(31),
"LEAGUE" varchar(31),
"NO_ASSTS" float,
"NO_ATBAT" float,
"NO_BB" float,
"NO_ERROR" float,
"NO_HITS" float,
"NO_HOME" float,
"NO_OUTS" float,
"NO_RBI" float,
"NO_RUNS" float,
"YR_MAJOR" float
)
RETURNS varchar(33)
AS '/usr/local/greenplum-db-3.3.4.0/lib/postgresql/SAS/
   sample_dbitest_homeeq_5.so', 'homeeq_5_em_classification'
After the scoring functions are installed, they can be invoked in Greenplum using SQL, as illustrated in the following example. Each output value is created as a separate function call in the select list.
select baseball1_EM_eventprobability
( 
"CR_ATBAT",
"CR_BB",
"CR_HITS",
"CR_HOME",
"CR_RBI",
"CR_RUNS",
"DIVISION",
"LEAGUE",
"NO_ASSTS",
"NO_ATBAT",
"NO_BB",
"NO_ERROR",
"NO_HITS",
"NO_HOME",
"NO_OUTS"
) as homeRunProb from MLBGP;