![]() | ![]() | ![]() |
Four ways to score (compute predicted values for) new observations using a previously-fitted model are discussed below.
Some procedures include features that make scoring new observations easier:
For ordinary regression models fit using PROC REG, you can use PROC SCORE to compute predicted values for new observations. See the example titled "Regression Parameter Estimates" in the SCORE procedure documentation. You do not need to copy or set the response to missing. Nor is refitting of the model necessary. PROC SCORE does not directly provide scoring for other types of models, but see the third scoring method below.
For a logistic or probit model, the scoring process is greatly simplified in PROC LOGISTIC. Its SCORE statement allows you to score a separate data set of observations. As with ordinary regression models, you do not need to copy or set the response to missing. Refitting of the model is not necessary if the model is saved using the OUTMODEL= option and then retrieved during scoring by the INMODEL= option. The LOGISTIC procedure documentation example titled "Scoring Data Sets with the SCORE Statement" illustrates the use of the SCORE statement with a nominal logistic model which can be fit in either PROC LOGISTIC or PROC CATMOD. Scoring new observations with PROC LOGISTIC is further discussed in this usage note. See this usage note for more on scoring with PROC PROBIT.
A SCORE statement is also available in the GLMSELECT, GAM, LOESS, and TPSPLINE modeling procedures. See the procedure documentation for discussion and examples. Additional restrictions for the computation of predicted values apply when scoring observations with PROC GAM.
PROC DISCRIM provides a TESTDATA= option, which allows you to specify a data set to be scored, and a TESTOUT= option which includes posterior probabilities and predicted classifications. See the example titled "Linear Discriminant Analysis of Remote-Sensing Data on Crops" in the DISCRIM procedure documentation.
Beginning with SAS/STAT 9.22 in SAS 9.2 TS2M3, many procedures provide a STORE statement to save the fitted model. You can then use the SCORE statement in PROC PLM to score a data set using the saved model. This is illustrated in the example titled "Scoring with PROC PLM" in the Examples section of the PROC PLM documentation. For more on the STORE statement, see "STORE statement" in the Shared Concepts and Topics chapter of the SAS/STAT User's Guide.
You can get predicted values for one or more settings of your predictor variables by adding observations to the input data that you used to fit (train) the model. The predictors in these new observations should be set to the values for which you want predicted values, but the response variable should be set to missing. If you have response values for these predictor settings and want to compare them to the model predictions, create a new variable that is a copy of the response variable before setting the response to missing.
With these new observations appended to your training data set, refit the model exactly as before. You should notice that the results are identical to the results you had before adding the new observations. This is because any observation that has a missing response value is ignored when fitting the model. However, the procedure can compute predicted values for such observations as long as it has nonmissing values for all of the model predictors (values of variables in the WEIGHT or FREQ statement, if used, must also be nonmissing and positive). In many procedures, you can request predicted values by specifying the P= option in the OUTPUT statement, but some procedures may use other syntax. See the procedure's documentation.
Example 1: Logistic Model Validation Using PROC GENMODModel validation often involves getting predictions for a potentially large number of observations which were held out from the original data. That is, the original data set is split into a data set to train the model and a data set to validate the model. Validation is done by comparing the values predicted under the model to the observed values in the validation data set (often called a hold-out data set). As indicated above, this is done by concatenating the training and validation data sets and using the combined data set as the input data set to the modeling procedure. A copy of the response variable holds the observed responses for comparison to the values predicted by the model. It is often convenient for the output data set to contain only the new, validation observations and to exclude the observations used to train the model. To do this, add a variable to the combined data set that indicates which observations are the training data set and which observations are the validation data set. You can use this indicator variable in a WHERE= data set option in the OUTPUT statement to select only the validation observations for output.
The following DATA step creates a SAS data set named REMISS that contains the training data for a logistic model to be fit by PROC GENMOD.
data remiss;
input remiss cell smear infil li blast temp;
datalines;
1 .8 .83 .66 1.9 1.1 .996
1 .9 .36 .32 1.4 .74 .992
0 .8 .88 .7 .8 .176 .982
0 1 .87 .87 .7 1.053 .986
1 .9 .75 .68 1.3 .519 .98
0 1 .65 .65 .6 .519 .982
1 .95 .97 .92 1 1.23 .992
0 .95 .87 .83 1.9 1.354 1.02
0 1 .45 .45 .8 .322 .999
0 .95 .36 .34 .5 0 1.038
0 .85 .39 .33 .7 .279 .988
0 .7 .76 .53 1.2 .146 .982
0 .8 .46 .37 .4 .38 1.006
0 .2 .39 .08 .8 .114 .99
0 1 .9 .9 1.1 1.037 .99
1 1 .84 .84 1.9 2.064 1.02
0 .65 .42 .27 .5 .114 1.014
0 1 .75 .75 1 1.322 1.004
0 .5 .44 .22 .6 .114 .99
1 1 .63 .63 1.1 1.072 .986
0 1 .33 .33 .4 .176 1.01
0 .9 .93 .84 .6 1.591 1.02
1 1 .58 .58 1 .531 1.002
0 .95 .32 .3 1.6 .886 .988
1 1 .6 .6 1.7 .964 .99
1 1 .69 .69 .9 .398 .986
0 1 .73 .73 .7 .398 .986
;
These statements fit the logistic model. The parameter estimates of the fitted model are shown.
proc genmod data=remiss descending;
model remiss = smear blast / dist=binomial;
run;
| ||||||||||||||||||||||||||||||||||||||||||||||||
This DATA step creates a validation data set, NEW. For purposes of illustration, the first eight observations of the training data set are used.
data new;
input remiss cell smear infil li blast temp;
cards;
1 .8 .83 .66 1.9 1.1 .996
1 .9 .36 .32 1.4 .74 .992
0 .8 .88 .7 .8 .176 .982
0 1 .87 .87 .7 1.053 .986
1 .9 .75 .68 1.3 .519 .98
0 1 .65 .65 .6 .519 .982
1 .95 .97 .92 1 1.23 .992
0 .95 .87 .83 1.9 1.354 1.02
;
The following DATA step concatenates the training and validation data sets into a single data set, BOTH, for input to PROC GENMOD. The IN= option in the SET statement creates a temporary variable, InNew, which equals 1 when the observation comes from the validation data set (NEW) and equals 0 when it comes from the training data set (REMISS). A permanent copy of this variable is created and named VALID. A copy of the response variable, RemissOBS, is created so that predicted values can be compared with the observed values. The response variable, REMISS, is then set to missing for all observations in the validation data set.
data both;
set remiss new (in=InNew);
valid=InNew;
RemissOBS=remiss;
if InNew then call missing(remiss);
run;
proc print noobs;
var valid RemissOBS remiss smear blast;
run;
|
These statements refit the model by running the same analysis code, but this time on the combined data set, BOTH. Notice that the results are identical because observations in the validation data set all have missing response values and so they are ignored in the model fitting process. The OUTPUT statement is added to produce a data set, PREDS, of predicted values. The WHERE clause after the OUT= data set name causes only those observations from the validation data set to be written to the data set. The L= and U= options request that 95% confidence limits be computed and output in addition to the predicted values requested by the P= option.
proc genmod data=both descending;
model remiss = smear blast / dist=binomial;
output out=preds(where=(valid=1)) p=pred l=lower u=upper;
run;
| ||||||||||||||||||||||||||||||||||||||||||||||||
A predicted REMISS level, RemissPred, is obtained by declaring the observation a predicted event if the predicted event probability is greater than or equal to 0.5. This is equivalent to a rule that classifies the observation according to the response level with the largest predicted probability.
data preds;
set preds;
RemissPred=(pred >= 0.5);
run;
The scored validation data set, including the predicted response category, is shown below.
proc print noobs;
var RemissObs RemissPred pred lower upper smear blast;
run;
|
A common way of summarizing the predictive ability of a categorical model is via a cross-classification of the predicted and observed responses, often called a confusion matrix. This is easily produced by PROC FREQ. The NOCOL NOROW and NOPERCENT options simply omit row, column, and overall percentages from the table and are optional. The OUT= option creates a data set containing the cell counts of the table in a variable named COUNT. Notice that three of the four observed nonevents (RemissOBS=0) were correctly classified, but all four of the observed events (RemissOBS=1) were incorrectly classified.
proc freq data=preds;
table RemissOBS * RemissPred / nocol norow nopercent out=CellCounts;
run;
| |||||||||||||||||||||||||
The correct classification rate can be computed by dividing the total count on the main diagonal of the table by the total count in the table. This is accomplished by creating a variable that indicates if the cell is on the diagonal and then computing the mean using COUNT as the frequency count variable. The correct classification rate for the validation data set is 3/8 = 0.375 .
data CellCounts;
set CellCounts;
Diag=(RemissOBS=RemissPred);
run;
proc means mean;
var Diag;
freq count;
run;
|
This method is useful when neither built-in scoring features nor the STORE statement is available, or when the augmentation method would be too costly. Note that augmentation can be inefficient when the training data set or model is very large because the augmentation method requires retraining the model each time you want to score observations. This method also shows the computations needed to score new observations should you need to do so on a system that does not have SAS installed.
Two important things must be remembered:
Note that if the value of a CLASS variable in an observation to be scored does not appear in the training data set, then that observation cannot be scored. This is because, unlike a continuous predictor, there is no parameter corresponding to that value in the trained model.
While predicted values can be obtained by this method, the computations for the standard errors of the predicted values are generally more complex and cannot be computed. As a result, confidence limits for the predicted values also cannot be computed.
Example 2: A Poisson Model with OffsetThe following poisson model is based on the data in the "Getting Started" section of the GENMOD documentation. Note that the model includes a continuous variable (age), a CLASS variable (car), and their interaction. The CLASS variable uses GENMOD's default coding method (PARAM=GLM). The model also includes an offset variable (ln). The following statements create the training data set and fit the desired model. The XVARS and P options in the MODEL statement display the predictor values and the predicted counts (Pred) for the observations in the training data set, shown below. In this example, the specified model happens to be a saturated model, so the predicted values equal the actual values. But this has no influence on the manner of scoring.
data insure;
input n c car $ age;
ln = log(n);
datalines;
500 42 small 1
1200 37 medium 1
100 1 large 1
400 101 small 2
500 73 medium 2
300 14 large 2
;
proc genmod data=insure;
class car;
model c = car age car*age/ dist=poisson link=log offset=ln xvars p;
ods output parameterestimates=pe;
run;
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Notice that the parameter estimates table was saved via the ODS OUTPUT statement. The variable containing the parameter estimates (estimate) is displayed to high precision by using a FORMAT statement in the following PROC PRINT step. The 12.10 format displays the estimates in field 12 digits wide and with 10 decimal places. These more precise values are used in the scoring computations below to more closely match what GENMOD does internally with full precision values.
proc print data=pe;
format estimate 12.10;
var parameter level: estimate;
run;
|
The following step does the scoring. In this example, the training data set is scored, so it is specified in the SET statement. A SELECT group should appear for each predictor in the CLASS statement to create the appropriately coded design variables. Since the PARAM= option was not specified in the CLASS statement, the default GLM coding is used. If a different coding method is requested via the PARAM= option in the CLASS statement, the coding of the design variables (named carlarge, carmedium, and carsmall in this example) would change. This is discussed further below. The parameter estimates from the preceding PROC PRINT step are used in the computation of the linear predictor, x'β. By definition, the parameter associated with an offset variable equals 1. x'β is computed by multiplying parameter estimates by predictor (or design) variables and adding the products. Finally, the inverse link function is applied to get a predicted mean. Since this poisson model uses the log link, the inverse link function is exponentiation which can be done with the EXP function in SAS. For ordinary regression models, such as those fit by PROC REG or PROC GLM, the link is the identity link and x'β is the predicted mean.
data scores;
set insure;
select (car);
when ("large") do;
carlarge=1; carmedium=0; carsmall=0; end;
when ("medium") do;
carlarge=0; carmedium=1; carsmall=0; end;
when ("small") do;
carlarge=0; carmedium=0; carsmall=1; end;
otherwise;
end;
xbeta=-3.577532930 +
-2.568082297*carlarge +
-1.456636259*carmedium +
0*carsmall +
1.1005944499*age +
0.4398505911*age*carlarge +
0.4544158160*age*carmedium +
0*age*carsmall +
1*ln
;
mu_hat=exp(xbeta);
run;
Notice that the computed scores (mu_hat) match the predicted values computed by the P option (Pred) in PROC GENMOD.
proc print noobs;
var car age c xbeta mu_hat;
run;
|
Had effects coding (PARAM=EFFECT) been specified, the following SELECT group would properly code the design variables for use in scoring:
select (car);
when ("large") do;
carlarge=1; carmedium=0; end;
when ("medium") do;
carlarge=0; carmedium=1; end;
when ("small") do;
carlarge=-1; carmedium=-1; end;
otherwise;
end;
And for reference coding (PARAM=REF), the SELECT group would be:
select (car);
when ("large") do;
carlarge=1; carmedium=0; end;
when ("medium") do;
carlarge=0; carmedium=1; end;
when ("small") do;
carlarge=0; carmedium=0; end;
otherwise;
end;
For more on the various types of CLASS variable coding, see "CLASS Variable Parameterization" in the Details section of the LOGISTIC procedure documentation.
Example 3: A Probit ModelThe following uses data from the example titled "Logistic Modeling with Categorical Predictors" in the LOGISTIC procedure documentation. PROC GENMOD is used to fit a probit model to the data to model the probability of no pain. Effects coding is used for the categorical predictor Treatment (A, B, or P) and reference coding is used for the Sex (F or M) with males (M) as the reference category.
data Neuralgia;
input Treatment $ Sex $ Age Duration Pain $ @@;
datalines;
P F 68 1 No B M 74 16 No P F 67 30 No
P M 66 26 Yes B F 67 28 No B F 77 16 No
A F 71 12 No B F 72 50 No B F 76 9 Yes
A M 71 17 Yes A F 63 27 No A F 69 18 Yes
B F 66 12 No A M 62 42 No P F 64 1 Yes
A F 64 17 No P M 74 4 No A F 72 25 No
P M 70 1 Yes B M 66 19 No B M 59 29 No
A F 64 30 No A M 70 28 No A M 69 1 No
B F 78 1 No P M 83 1 Yes B F 69 42 No
B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No A F 69 12 No B F 65 14 No
B M 70 1 No B M 67 23 No A M 76 25 Yes
P M 78 12 Yes B M 77 1 Yes B F 69 24 No
P M 66 4 Yes P F 65 29 No P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No
P F 72 27 No P F 70 13 Yes A M 75 6 Yes
B F 65 7 No P F 68 27 Yes P M 68 11 Yes
P M 67 17 Yes B M 70 22 No A M 65 15 No
P F 67 1 Yes A M 67 10 No P F 72 11 Yes
A F 74 1 No B M 80 21 Yes A F 69 3 No
;
proc genmod data=Neuralgia;
class Treatment (param=effect) Sex (param=ref ref="M");
model Pain = Treatment Sex Treatment*Sex Age Duration / dist=binomial link=probit;
output out=preds p=PrNoPain;
ods output parameterestimates=parms;
run;
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Below are the scores (predicted probabilities of no pain) for the first six observations which the scoring step below will reproduce.
proc print data=preds(obs=6) noobs;
run;
|
These statements display the parameter estimates of the probit model with more precision for use in scoring.
proc print data=parms noobs;
format estimate 12.10;
var parameter level: estimate;
run;
|
To illustrate scoring, the first six observations of the training data set are used as a validation data set. The scores for these observations should equal the predicted values computed by the GENMOD procedure above. Two additional observations are included — one with an invalid Treatment code (X) and one with a missing value for Sex. The first observation cannot be scored because there is no parameter for Treatment X in the model. In order to score this point, the training data would need to include some subjects who were given Treatment X. The second observation cannot be scored since values for all predictors in the model must be nonmissing in order to make a valid computation. See this usage note for more discussion.
data valid;
input Treatment $ Sex $ Age Duration Pain $ @@;
datalines;
P F 68 1 No B M 74 16 No P F 67 30 No
P M 66 26 Yes B F 67 28 No B F 77 16 No
X F 50 10 . B . 32 15 .
;
proc print noobs;
run;
|
In the scoring step below, a SELECT group is included for each of the two categorical predictors, Treatment and Sex, using coding that matches the coding used when training the model — effects coding for Treatment and reference coding for Sex. The "Class Level Information" table (above) produced by PROC GENMOD shows you how the design variables are coded. x'β is computed using the high precision parameter estimates displayed above. Since the inverse of the probit link function is the probability from the standard normal distribution, you can use the PROBNORM function in SAS. Had the logit link been used to produce a logistic model, you would use the inverse logit function, 1/(1+exp(-x'β)), which can also be computed using the LOGISTIC function: logistic(xbeta).
data scores;
set valid;
select (Treatment);
when ("A") do;
TrtA=1; TrtB=0; end;
when ("B") do;
TrtA=0; TrtB=1; end;
when ("P") do;
TrtA=-1; TrtB=-1; end;
otherwise;
end;
select (Sex);
when ("F") SexF=1;
when ("M") SexF=0;
otherwise;
end;
xbeta=9.9221347360 +
0.5139396491*TrtA +
0.7200040408*TrtB +
0.9404279566*SexF +
-.1621550070*TrtA*SexF +
0.1580056737*TrtB*SexF +
-.1439733988*age +
0.0006380643*duration
;
PrNoPain=probnorm(xbeta);
run;
Notice that the predicted probabilities for the first six observations match those computed by PROC GENMOD above, and the predicted probabilities for the last two observations are missing as expected.
proc print noobs;
var treatment sex age duration pain xbeta PrNoPain;
run;
|
Note that PROC LOGISTIC can fit a probit model and also provides effects and reference coding. Since it has built-in scoring capability via its SCORE statement, you can fit the model and score the validation data all in a single step. Any slight differences are due to minor differences in starting values and iteration methods used by GENMOD and LOGISTIC.
proc logistic data=Neuralgia;
class Treatment (param=effect) Sex (param=ref ref="M");
model Pain = Treatment Sex Treatment*Sex Age Duration / link=probit;
score data=valid out=validscore;
run;
proc print data=validscore noobs;
var treatment sex age duration pain P_No;
run;
|
| Product Family | Product | System | SAS Release | |
| Reported | Fixed* | |||
| SAS System | SAS/STAT | z/OS | ||
| OpenVMS VAX | ||||
| Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
| Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
| Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
| Microsoft Windows XP 64-bit Edition | ||||
| Microsoft® Windows® for x64 | ||||
| OS/2 | ||||
| Microsoft Windows 95/98 | ||||
| Microsoft Windows 2000 Advanced Server | ||||
| Microsoft Windows 2000 Datacenter Server | ||||
| Microsoft Windows 2000 Server | ||||
| Microsoft Windows 2000 Professional | ||||
| Microsoft Windows NT Workstation | ||||
| Microsoft Windows Server 2003 Datacenter Edition | ||||
| Microsoft Windows Server 2003 Enterprise Edition | ||||
| Microsoft Windows Server 2003 Standard Edition | ||||
| Microsoft Windows XP Professional | ||||
| Windows Millennium Edition (Me) | ||||
| Windows Vista | ||||
| 64-bit Enabled AIX | ||||
| 64-bit Enabled HP-UX | ||||
| 64-bit Enabled Solaris | ||||
| ABI+ for Intel Architecture | ||||
| AIX | ||||
| HP-UX | ||||
| HP-UX IPF | ||||
| IRIX | ||||
| Linux | ||||
| Linux for x64 | ||||
| Linux on Itanium | ||||
| OpenVMS Alpha | ||||
| OpenVMS on HP Integrity | ||||
| Solaris | ||||
| Solaris for x64 | ||||
| Tru64 UNIX | ||||
| Type: | Usage Note |
| Priority: | |
| Topic: | Analytics ==> Regression Analytics ==> Longitudinal Analysis Analytics ==> Mixed Models SAS Reference ==> Procedures ==> COUNTREG SAS Reference ==> Procedures ==> GAM SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> LIFEREG SAS Reference ==> Procedures ==> LOESS SAS Reference ==> Procedures ==> LOGISTIC SAS Reference ==> Procedures ==> PROBIT SAS Reference ==> Procedures ==> REG SAS Reference ==> Procedures ==> TPSPLINE SAS Reference ==> Procedures ==> GLIMMIX SAS Reference ==> Procedures ==> GLM SAS Reference ==> Procedures ==> HPMIXED SAS Reference ==> Procedures ==> MIXED SAS Reference ==> Procedures ==> NLIN SAS Reference ==> Procedures ==> NLMIXED SAS Reference ==> Procedures ==> PLS SAS Reference ==> Procedures ==> RSREG Analytics ==> Categorical Data Analysis SAS Reference ==> Procedures ==> GLMSELECT SAS Reference ==> Procedures ==> SURVEYLOGISTIC SAS Reference ==> Procedures ==> SURVEYREG SAS Reference ==> Procedures ==> DISCRIM Analytics ==> Survival Analysis SAS Reference ==> Procedures ==> PHREG SAS Reference ==> Procedures ==> SURVEYPHREG SAS Reference ==> Procedures ==> ORTHOREG |
| Date Modified: | 2009-01-29 14:21:48 |
| Date Created: | 2008-09-15 16:21:59 |



