Contents: Scoring methods and examples
|
Four ways to score (compute predicted values for) new observations using a previously fitted model are discussed below. Note that several conditions can make it impossible to score a new observation, resulting in a missing predicted value. These conditions are described in this note.
Beginning with SAS/STAT® 9.22 in SAS 9.2 TS2M3, many procedures provide a STORE statement to save the fitted model. You can then use the SCORE statement in PROC PLM to score a data set using the saved model. This is illustrated in the example titled "Scoring with PROC PLM" in the Examples section of the PLM documentation and at the end of Example 1 below. For more on the STORE statement, see "STORE statement" in the Shared Concepts and Topics chapter of the SAS/STAT User's Guide.
Some procedures include features that make scoring new observations easier:
For ordinary regression models fit using PROC REG, you can use PROC SCORE to compute predicted values for new observations. See the example titled "Regression Parameter Estimates" in the SCORE documentation. It is not necessary to refit the model. However, PROC SCORE does not directly provide scoring for other types of models such as logistic or other generalized linear models. It also does not provide standard error estimates or confidence limits.
For a logistic or probit model, the scoring process is greatly simplified in PROC LOGISTIC. Its SCORE statement enables you to score a data set of new observations. The FITSTAT and OUTROC= options in the SCORE statement enable you to evaluate the model applied to the new data set. The FITSTAT option provides fit statistics such as the area under the ROC curve (AUC) and R-square (beginning in SAS 9.3). The OUTROC= option produces a data set for plotting the ROC curve. An ROC plot and analysis for validation data can be obtained as described in this note. As with ordinary regression models, refitting the model is not necessary if the model is saved using the OUTMODEL= option and then retrieved during scoring by the INMODEL= option. The example titled "Scoring Data Sets with the SCORE Statement" in the LOGISTIC documentation illustrates the use of the SCORE statement with a nominal logistic model.
A SCORE statement is also available in several other modeling procedures such as GLMSELECT, GAM, LOESS, TPSPLINE, and ADAPTIVEREG. See the procedure documentation for discussion and examples.
PROC DISCRIM provides a TESTDATA= option that enables you to specify a data set to be scored, and a TESTOUT= option that includes posterior probabilities and predicted classifications. See the example titled "Linear Discriminant Analysis of Remote-Sensing Data on Crops" in the DISCRIM documentation.
Beginning with SAS/STAT 12.1 in SAS 9.3 TS1M2, the CODE statement is available in several modeling procedures. The CODE statement generates SAS code that can be used in a DATA step to score a data set. See "CODE statement" in the Shared Concepts and Topics chapter of the SAS/STAT User's Guide for an example of using the CODE statement.
You can get predicted values for one or more settings of your model predictors by adding observations to the input data that you use to fit (train) the model. The predictors in these new observations should be set to the values for which you want predicted values. For the added observations, either the response variable should be set to missing, or if the new observations have observed values then a WEIGHT variable should be created with value 1 for the training observations and value 0 (or missing) for the new observations.
With these new observations appended to your training data set, the fitted model should be identical to the model fit using only the training data. This is because any observation that has a missing response value or zero (or missing) weight is ignored when fitting the model. (The exception to this is when the model includes spline effects defined in the EFFECT statement. See the Extrapolation section of this note for details.) The procedure can compute predicted values for such observations as long as they have nonmissing values for all of the model predictors and have values for CLASS predictors that existed in the training data set. This is further explained and illustrated in this note. In many procedures, you can request predicted values by specifying the P= option in the OUTPUT statement, but some procedures use other syntax. See the procedure's documentation.
Example 1: Logistic Model Validation Using PROC GENMOD
Model validation often involves getting predictions for a potentially large number of observations that were held out from the original data. That is, the original data set is split into a data set to train the model and a data set to validate the model. Validation is done by comparing the values predicted under the model to the observed values in the validation data set (often called a hold-out data set). One way this can be done is by concatenating the training and validation data sets and using the combined data set as the input data set to the modeling procedure. It is often convenient for the output data set to contain only the validation observations, excluding the observations used to train the model. To do this, add a variable to the combined data set that indicates which observations are the training data set and which observations are the validation data set. You can use this indicator variable in a WHERE= data set option in the OUTPUT statement to select only the validation observations for output.
The following DATA step creates a SAS data set named REMISS that contains the training data for a logistic model to be fit by PROC GENMOD.
data remiss; input remiss cell smear infil li blast temp; datalines; 1 .8 .83 .66 1.9 1.1 .996 1 .9 .36 .32 1.4 .74 .992 0 .8 .88 .7 .8 .176 .982 0 1 .87 .87 .7 1.053 .986 1 .9 .75 .68 1.3 .519 .98 0 1 .65 .65 .6 .519 .982 1 .95 .97 .92 1 1.23 .992 0 .95 .87 .83 1.9 1.354 1.02 0 1 .45 .45 .8 .322 .999 0 .95 .36 .34 .5 0 1.038 0 .85 .39 .33 .7 .279 .988 0 .7 .76 .53 1.2 .146 .982 0 .8 .46 .37 .4 .38 1.006 0 .2 .39 .08 .8 .114 .99 0 1 .9 .9 1.1 1.037 .99 1 1 .84 .84 1.9 2.064 1.02 0 .65 .42 .27 .5 .114 1.014 0 1 .75 .75 1 1.322 1.004 0 .5 .44 .22 .6 .114 .99 1 1 .63 .63 1.1 1.072 .986 0 1 .33 .33 .4 .176 1.01 0 .9 .93 .84 .6 1.591 1.02 1 1 .58 .58 1 .531 1.002 0 .95 .32 .3 1.6 .886 .988 1 1 .6 .6 1.7 .964 .99 1 1 .69 .69 .9 .398 .986 0 1 .73 .73 .7 .398 .986 ;
This DATA step creates a validation data set, NEW. For purposes of illustration, the first eight observations of the training data set are used.
data new; input remiss cell smear infil li blast temp; cards; 1 .8 .83 .66 1.9 1.1 .996 1 .9 .36 .32 1.4 .74 .992 0 .8 .88 .7 .8 .176 .982 0 1 .87 .87 .7 1.053 .986 1 .9 .75 .68 1.3 .519 .98 0 1 .65 .65 .6 .519 .982 1 .95 .97 .92 1 1.23 .992 0 .95 .87 .83 1.9 1.354 1.02 ;
The following DATA step concatenates the training and validation data sets into a single data set, BOTH, for input to PROC GENMOD. The IN= option in the SET statement creates a temporary variable, InNew, which equals 1 when the observation comes from the validation data set (NEW) and equals 0 when it comes from the training data set (REMISS). The inverse of this variable, W, is created for use as a weight variable. W equals 1 for the training observations, 0 for the validation observations.
data both; set remiss new (in=InNew); w=not(InNew); run; proc print noobs; var remiss w smear blast; run;
The combined data set is shown below.
|
These statements fit the model using the combined data set, BOTH. The training indicator variable, W, is used in the WEIGHT statement. The results are identical to a GENMOD analysis on just the training data set because observations in the validation data set have zero weight and are ignored in the model fitting process. The OUTPUT statement produces a data set, PREDS, of predicted values. The WHERE clause after the OUT= data set name causes only those observations from the validation data set to be written to the data set. The L= and U= options request that 95% confidence limits be computed and output in addition to the predicted values requested by the P= option.
proc genmod data=both descending; weight w; model remiss = smear blast / dist=binomial; output out=preds(where=(w=0)) p=pred l=lower u=upper; run; proc print data=preds noobs; var remiss pred lower upper smear blast; run;
|
For each data set that you want to score, you would need to use this same process that involves refitting the model to the training data set. This can be avoided by using the STORE statement in PROC GENMOD and the SCORE statement in PROC PLM. The following GENMOD step fits the model and the STORE statement saves the model. To score each new data set, only a PLM step is required. Two data sets (NEW and NEW2) are scored in the following example. The ILINK option in the SCORE statement uses the inverse of the link function (logit, in this case) to obtain estimates on the mean (probability) scale.
proc genmod data=remiss descending; model remiss = smear blast / dist=binomial; store out=logmod; run; proc plm source=logmod; score data=new out=preds pred=pred lclm=lower uclm=upper / ilink; run; proc plm source=logmod; score data=new2 out=preds pred=pred lclm=lower uclm=upper / ilink; run;
Some important issues must be remembered in order to correctly and accurately compute predicted values:
Note that if the value of a CLASS variable in an observation to be scored does not appear in the training data set, then that observation cannot be scored. This is because, unlike a continuous predictor, there is no parameter corresponding to that value in the trained model as explained in this note and in Example 3 below.
Predicted values can be obtained by this method, but the computations for the standard errors of the predicted values are generally more complex and cannot be computed. As a result, confidence limits for the predicted values also cannot be computed.
Example 2: A Poisson Model with Offset
The following Poisson model is based on the data in the "Getting Started" section of the GENMOD documentation. Note that the model includes a continuous variable (age), a CLASS variable (car), and their interaction. The CLASS variable uses GENMOD's default coding method (PARAM=GLM). The model also includes an offset variable (ln). The following statements create the training data set and fit the desired model. The XVARS and P options in the MODEL statement display the predictor values and the predicted counts (Pred) for the observations in the training data set, shown below. In this example, the specified model happens to be a saturated model, so the predicted values equal the actual values. But this has no influence on the manner of scoring.
data insure; input n c car $ age; ln = log(n); datalines; 500 42 small 1 1200 37 medium 1 100 1 large 1 400 101 small 2 500 73 medium 2 300 14 large 2 ; proc genmod data=insure; class car; model c = car age car*age/ dist=poisson link=log offset=ln xvars p; ods output parameterestimates=pe; run;
|
Notice that the parameter estimates table was saved via the ODS OUTPUT statement. The variable containing the parameter estimates (Estimate) is displayed to high precision by using a FORMAT statement in the following PROC PRINT step. The 12.10 format displays the estimates in a field 12 digits wide and with 10 decimal places. These more precise values are used in the scoring computations below to more closely match what GENMOD does internally with full precision values.
proc print data=pe; format estimate 12.10; var parameter level: estimate; run;
|
The following step does the scoring. In this example, the training data set is scored, so it is specified in the SET statement. A SELECT group should appear for each predictor in the CLASS statement to create the appropriately coded design variables. Since the PARAM= option was not specified in the CLASS statement, the default GLM coding is used. If a different coding method is requested via the PARAM= option in the CLASS statement, the coding of the design variables (named carlarge, carmedium, and carsmall in this example) would change. This is discussed further below. The parameter estimates from the preceding PROC PRINT step are used in the computation of the linear predictor, x'β. By definition, the parameter associated with an offset variable equals 1. x'β is computed by multiplying parameter estimates by predictor (or design) variables and adding the products. Finally, the inverse link function is applied to get a predicted mean. Since this poisson model uses the log link, the inverse link function is exponentiation that can be done with the EXP function in SAS. For ordinary regression models, such as those fit by PROC REG or PROC GLM, the link is the identity link and x'β is the predicted mean.
data scores; set insure; select (car); when ("large") do; carlarge=1; carmedium=0; carsmall=0; end; when ("medium") do; carlarge=0; carmedium=1; carsmall=0; end; when ("small") do; carlarge=0; carmedium=0; carsmall=1; end; otherwise; end; xbeta=-3.577532930 + -2.568082297*carlarge + -1.456636259*carmedium + 0*carsmall + 1.1005944499*age + 0.4398505911*age*carlarge + 0.4544158160*age*carmedium + 0*age*carsmall + 1*ln ; mu_hat=exp(xbeta); run;
Notice that the computed scores (mu_hat) match the predicted values computed by the P option (Pred) in PROC GENMOD.
proc print noobs; var car age c xbeta mu_hat; run;
|
Had effects coding (PARAM=EFFECT) been specified, the following SELECT group would properly code the design variables for use in scoring:
select (car); when ("large") do; carlarge=1; carmedium=0; end; when ("medium") do; carlarge=0; carmedium=1; end; when ("small") do; carlarge=-1; carmedium=-1; end; otherwise; end;
For reference coding (PARAM=REF), this SELECT group would be used:
select (car); when ("large") do; carlarge=1; carmedium=0; end; when ("medium") do; carlarge=0; carmedium=1; end; when ("small") do; carlarge=0; carmedium=0; end; otherwise; end;
For more on the various types of CLASS variable coding, see "CLASS Variable Parameterization" in the Details section of the LOGISTIC procedure documentation.
The following uses data from the example titled "Logistic Modeling with Categorical Predictors" in the LOGISTIC procedure documentation. PROC GENMOD is used to fit a probit model to the data to model the probability of no pain. Effects coding is used for the categorical predictor Treatment (A, B, or P) and reference coding is used for the Sex (F or M) with males (M) as the reference category.
data Neuralgia; input Treatment $ Sex $ Age Duration Pain $ @@; datalines; P F 68 1 No B M 74 16 No P F 67 30 No P M 66 26 Yes B F 67 28 No B F 77 16 No A F 71 12 No B F 72 50 No B F 76 9 Yes A M 71 17 Yes A F 63 27 No A F 69 18 Yes B F 66 12 No A M 62 42 No P F 64 1 Yes A F 64 17 No P M 74 4 No A F 72 25 No P M 70 1 Yes B M 66 19 No B M 59 29 No A F 64 30 No A M 70 28 No A M 69 1 No B F 78 1 No P M 83 1 Yes B F 69 42 No B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes A M 70 12 No A F 69 12 No B F 65 14 No B M 70 1 No B M 67 23 No A M 76 25 Yes P M 78 12 Yes B M 77 1 Yes B F 69 24 No P M 66 4 Yes P F 65 29 No P M 60 26 Yes A M 78 15 Yes B M 75 21 Yes A F 67 11 No P F 72 27 No P F 70 13 Yes A M 75 6 Yes B F 65 7 No P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No A M 65 15 No P F 67 1 Yes A M 67 10 No P F 72 11 Yes A F 74 1 No B M 80 21 Yes A F 69 3 No ; proc genmod data=Neuralgia; class Treatment (param=effect) Sex (param=ref ref="M"); model Pain = Treatment Sex Treatment*Sex Age Duration / dist=binomial link=probit; output out=preds p=PrNoPain; ods output parameterestimates=parms; run;
|
Below are the scores (predicted probabilities of no pain) for the first six observations, which the scoring step below reproduces.
proc print data=preds(obs=6) noobs; run;
|
These statements display the parameter estimates of the probit model with more precision for use in scoring.
proc print data=parms noobs; format estimate 12.10; var parameter level: estimate; run;
|
To illustrate scoring, the first six observations of the training data set are used as a validation data set. The scores for these observations should equal the predicted values computed by the GENMOD procedure above. Two additional observations are included — one with an invalid Treatment code (X) and one with a missing value for Sex. The first observation cannot be scored because there is no parameter for Treatment X in the model. In order to score this point, the training data would need to include some subjects who were given Treatment X. The second observation cannot be scored since values for all predictors in the model must be nonmissing in order to make a valid computation. See this usage note for more discussion.
data valid; input Treatment $ Sex $ Age Duration Pain $ @@; datalines; P F 68 1 No B M 74 16 No P F 67 30 No P M 66 26 Yes B F 67 28 No B F 77 16 No X F 50 10 . B . 32 15 . ; proc print noobs; run;
|
In the scoring step below, a SELECT group is included for each of the two categorical predictors, Treatment and Sex, using coding that matches the coding used when training the model — effects coding for Treatment and reference coding for Sex. The "Class Level Information" table (above) produced by PROC GENMOD shows you how the design variables are coded. x'β is computed using the high precision parameter estimates displayed above. Since the inverse of the probit link function is the probability from the standard normal distribution, you can use the PROBNORM function in SAS. Had the logit link been used to produce a logistic model, you would use the inverse logit function, 1/(1+exp(-x'β)), which can also be computed using the LOGISTIC function: logistic(xbeta)
.
data scores; set valid; select (Treatment); when ("A") do; TrtA=1; TrtB=0; end; when ("B") do; TrtA=0; TrtB=1; end; when ("P") do; TrtA=-1; TrtB=-1; end; otherwise; end; select (Sex); when ("F") SexF=1; when ("M") SexF=0; otherwise; end; xbeta=9.9221347360 + 0.5139396491*TrtA + 0.7200040408*TrtB + 0.9404279566*SexF + -.1621550070*TrtA*SexF + 0.1580056737*TrtB*SexF + -.1439733988*age + 0.0006380643*duration ; PrNoPain=probnorm(xbeta); run;
Notice that the predicted probabilities for the first six observations match those computed by PROC GENMOD above, and the predicted probabilities for the last two observations are missing as expected.
proc print noobs; var treatment sex age duration pain xbeta PrNoPain; run;
|
Note that PROC LOGISTIC can fit a probit model and also provides effects and reference coding. Since it has built-in scoring capability via its SCORE statement, you can fit the model and score the validation data all in a single step. Any slight differences are due to minor differences in starting values and iteration methods used by GENMOD and LOGISTIC.
proc logistic data=Neuralgia; class Treatment (param=effect) Sex (param=ref ref="M"); model Pain = Treatment Sex Treatment*Sex Age Duration / link=probit; score data=valid out=validscore; run; proc print data=validscore noobs; var treatment sex age duration pain P_No; run;
|
Example 4: Scoring a model containing spline effects
See this example that discusses the types of spline transformations available in the EFFECT statement and illustrates reproducing the spline basis functions and scoring data.
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | z/OS | ||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows XP Professional | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: | |
Topic: | Analytics ==> Regression Analytics ==> Longitudinal Analysis Analytics ==> Mixed Models SAS Reference ==> Procedures ==> COUNTREG SAS Reference ==> Procedures ==> GAM SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> LIFEREG SAS Reference ==> Procedures ==> LOESS SAS Reference ==> Procedures ==> LOGISTIC SAS Reference ==> Procedures ==> PROBIT SAS Reference ==> Procedures ==> REG SAS Reference ==> Procedures ==> TPSPLINE SAS Reference ==> Procedures ==> GLIMMIX SAS Reference ==> Procedures ==> GLM SAS Reference ==> Procedures ==> HPMIXED SAS Reference ==> Procedures ==> MIXED SAS Reference ==> Procedures ==> NLIN SAS Reference ==> Procedures ==> NLMIXED SAS Reference ==> Procedures ==> PLS SAS Reference ==> Procedures ==> RSREG Analytics ==> Categorical Data Analysis SAS Reference ==> Procedures ==> GLMSELECT SAS Reference ==> Procedures ==> SURVEYLOGISTIC SAS Reference ==> Procedures ==> SURVEYREG SAS Reference ==> Procedures ==> DISCRIM Analytics ==> Survival Analysis SAS Reference ==> Procedures ==> PHREG SAS Reference ==> Procedures ==> SURVEYPHREG SAS Reference ==> Procedures ==> ORTHOREG SAS Reference ==> Procedures ==> PLM SAS Reference ==> Procedures ==> HPLOGISTIC SAS Reference ==> Procedures ==> HPREG SAS Reference ==> Procedures ==> HPGENSELECT SAS Reference ==> Procedures ==> HPPRINCOMP SAS Reference ==> Procedures ==> HPSPLIT |
Date Modified: | 2016-08-26 10:11:38 |
Date Created: | 2008-09-15 16:21:59 |