When the &EM_ACTION
macro variable is set to TRAIN, the reg_train.source entry is executed.
This extension node simply executes the REG procedure. The extension
node has data requirements:
-
There must be a training data set
imported by the node. If not, an exception is thrown indicating that
the user must specify a training data set.
Note: In this example, the exception
string has been set to an encoding string that is recognized by the
SAS Enterprise Miner client.
-
There must be an interval target
variable. If not, an exception is thrown indicating that the user
must specify an interval target variable.
The %EM_GETNAME macro
is called to initialize the &EM_USER_OUTEST and &EM_USER_EFFECTS
macro variables. These data sets are used to store the parameter estimates.
%macro train;
%if %sysfunc(index(&EM_DEBUG, SOURCE))>0 or
%sysfunc(index(&EM_DEBUG, ALL))>0 %then %do;
options mprint;
%end;
%if (^%sysfunc(exist(&EM_IMPORT_DATA)) and
^%sysfunc(exist(&EM_IMPORT_DATA, VIEW)))
or "&EM_IMPORT_DATA" eq "" %then %do;
%let EMEXCEPTIONSTRING = exception.server.IMPORT.NOTRAIN,1;
%goto doenda;
%end;
%if (%EM_INTERVAL_TARGET eq ) %then %do;
%let EMEXCEPTIONSTRING = exception.server.METADATA.USE1INTERVALTARGET;
%goto doenda;
%end;
%em_getname(key=OUTEST, TYPE=DATA);
%em_getname(key=EFFECTS, type=DATA);
%procreg;
%makeScoreCode;
%em_model(TARGET=&targetvar,
ASSESS=Y,
DECSCORECODE=Y,
FITSTATISTICS=Y,
CLASSIFICATION=N,
RESIDUALS=Y);
%em_report(key=EFFECTS,
viewtype=BAR,
TIPTEXT=VARIABLE,
X=VARIABLE,
Freq=TVALUE,
Autodisplay=Y,
description=%nrbquote(Effects Plot),
block=MODEL);
%doenda:
%mend train;
In the %procreg macro,
we fit a linear regression model using the REG procedure:
-
Using the ODS system, create the
EFFECTS data set containing the parameter estimates.
-
If the Details property is set
to Yes (corresponds to the &EM_PROPERTY_DETAILS macro variable),
then the DETAILS options of the MODEL statement is used.
-
The model uses all interval and
rejected variables with the “Use” attribute set to “Yes”.
Those variables are assigned to the %EM_INTERVAL_INPUT and %EM_INTERVAL_REJECTED
macros.
-
If a frequency variable is defined,
the FREQ statement is used.
%macro procreg;
%global targetVar;
%let targetVar = %scan(%EM_INTERVAL_TARGET, 1, );
ods output parameterestimates= &EM_USER_EFFECTS;
proc reg data=&EM_IMPORT_DATA OUTEST=&EM_USER_OUTEST;
model &targetVar = %EM_INTERVAL_INPUT %EM_INTERVAL_REJECTED
%if %upcase(&EM_PROPERTY_METHOD) ne NONE %then %do;
selection= &EM_PROPERTY_METHOD
%end;
;
%if %EM_FREQ ne %then %do;
freq %EM_FREQ;
%end;
run;
ods _all_ close;
ods listing;
%mend procreg;
The EFFECTS data set
has the following structure:
Model Dependent Variable DF Estimate StdErr tValue Probt
MODEL1 amount Intercept 1 -1130.54625 534.48857 -2.12 0.0347
MODEL1 amount age 1 14.12780 5.53920 2.55 0.0109
MODEL1 amount duration 1 136.22034 5.32411 25.59 <.0001
MODEL1 amount employed 1 -108.10434 52.16738 -2.07 0.0385
MODEL1 amount foreign 1 567.01572 323.58225 1.75 0.0800
MODEL1 amount installp 1 -830.99671 54.44354 -15.26 <.0001
MODEL1 amount job 1 570.83009 103.14025 5.53 <.0001
MODEL1 amount property 1 263.71329 62.04117 4.25 <.0001
MODEL1 amount savings 1 56.29680 38.38939 1.47 0.1428
MODEL1 amount telephon 1 642.84575 135.33767 4.75 <.0001
You can easily generate
the scoring code using this data set.
The OUTEST data set
contains the parameter estimates for variables in the final model,
but also identifies variables that are excluded from the model. It
has the following structure:
_MODEL_ _TYPE_ _DEPVAR_ _RMSE_ Intercept age
MODEL1 PARMS amount 1892.16 -1130.55 14.1278
checking coapp depends
. . .
Note that the above
output has been separated onto multiple rows for display purposes
only.
The %makeScoreCode macro
retrieves the name of the predicted variable using the decision metadata
data set. If only one target variable is defined, that data set corresponds
to the &EM_DEC_DECMETA macro variable. If multiple target variables
are defined, you can retrieve the decision metadata data set from
the &EM_TARGETDECINFO data set.
The %fillfile macro
processes the EFFECTS data set, generates the scoring code, and saves
it in the &EM_FILE_EMPUBLISHSCORECODE and &EM_FILE_FLOWSCORECODE
files that correspond to the Publish and Flow scoring code, respectively.
%macro fillFile(type=, predVar=, file=);
filename tempf "&file";
data _null_;
file tempf;
set &EM_USER_EFFECTS end=eof;
if _N_=1 then do;
put "&predVar = ";
if Variable = 'Intercept' then
put Estimate;
else
put Estimate '*' Variable;
end;
else do;
put '+' Estimate '*' Variable;
end;
if eof then do;
put ";";
end;
run;
filename tempf;
%mend fillFile;
%macro makeScoreCode;
%let predvar=;
%if &em_dec_decmeta eq %then %do;
%if %sysfunc(exist(EM_TARGETDECINFO)) %then %do;
data _null_;
set EM_TARGETDECINFO;
where TARGET="&targetVar";
call symput('em_dec_decmeta', DECMETA);
run;
%end;
%end;
%if (&em_dec_decmeta ne ) and %sysfunc(exist(&em_dec_decmeta)) %then %do;
data _null_;
set &em_dec_decmeta;
where _TYPE_ = 'PREDICTED';
call symput('predVar', strip(VARIABLE));
call symput('predLabel', strip(LABEL));
run;
%end;
%if &predVar eq %then %goto doendm;
%fillFile(type=publish, predvar=&predVar, file=&EM_FILE_EMPUBLISHSCORECODE);
%fillFile(type=flow, predvar=&predVar, file=&EM_FILE_EMFLOWSCORECODE);
%doendm:
%mend makeScoreCode;
The generated scoring
code has the following form:
P_amount =
-1130.54625
+14.12780 *age
+136.22034 *duration
+-108.10434 *employed
+567.01572 *foreign
+-830.99671 *installp
+570.83009 *job
+263.71329 *property
+56.29680 *savings
+642.84575 *telephon
;
The %EM_MODEL macro
is used to generate additional scoring code and to produce assessment
reports.
%em_model(TARGET=&targetvar,
ASSESS=Y,
DECSCORECODE=Y,
FITSTATISTICS=Y,
CLASSIFICATION=N,
RESIDUALS=Y);
-
ASSESS=Y
— indicates to generate assessment reports (Score Rankings
and Score Distribution).
-
DECSCORECODE=Y
— indicates to append score code to generate decision variables
when a profit matrix is defined.
-
FITSTATISTICS=Y
— indicates to compute fit statistics associated with the
model. Those are computed for the training data set and for validation
and test data sets when applicable.
-
CLASSIFICATION=N
— indicates not to generate report and score code associated
with the classification variables (I_).
-
RESIDUALS=Y
— indicates to append the code generating the residual variable
(R_) to the flow score code and produce the residual report.
For example, the Flow
scoring code would now appear as follows:
P_amount =
-1130.54625
+14.12780 *age
+136.22034 *duration
+-108.10434 *employed
+567.01572 *foreign
+-830.99671 *installp
+570.83009 *job
+263.71329 *property
+56.29680 *savings
+642.84575 *telephon
;
*------------------------------------------------------------*;
*Computing Residual Vars: amount;
*------------------------------------------------------------*;
Label R_amount = 'Residual: amount';
R_amount = amount - P_amount;
The %EM_REPORT macro
generates a graph of the parameter estimates:
%em_report(key=EFFECTS,
viewtype=BAR,
TIPTEXT=VARIABLE,
X=VARIABLE,
Freq=TVALUE,
Autodisplay=Y,
description=%nrbquote(Effects Plot),
block=MODEL);
-
Key=EFFECTS
— identifies the data set used to produce the chart.
-
Viewtype=BAR
— indicates to generate a BAR graph.
-
TIPTEXT=VARIABLE
— indicates that the variable named VARIABLE is to be used
to identify a bar when clicking on it.
-
X=VARIABLE
— indicates that the bar chart should have one bar for each
variable.
-
FREQ=TVALUE
— specifies that the variable TVALUE should be used to control
the height of the various bar.
-
AutoDisplay=Y
— indicates to display the report whenever the Results viewer
of the node is opened.
-
Description==%nrbquote(Effects
Plot)
— specifies the title bar of the report.
-
Block=MODEL
— indicates that the report should appear under the “Model”
menu item.