PROC LOGISTIC: Scoring Data Sets :: SAS/STAT(R) 9.2 User's Guide, Second Edition

The LOGISTIC Procedure

Scoring Data Sets

Scoring a data set, which is especially important for predictive modeling, means applying a previously fitted model to a new data set in order to compute the conditional, or posterior, probabilities of each response category given the values of the explanatory variables in each observation.

The SCORE statement enables you to score new data sets and output the scored values and, optionally, the corresponding confidence limits into a SAS data set. If the response variable is included in the new data set, then you can request fit statistics for the data, which is especially useful for test or validation data. If the response is binary, you can also create a SAS data set containing the receiver operating characteristic (ROC) curve. You can specify multiple SCORE statements in the same invocation of PROC LOGISTIC.

By default, the posterior probabilities are based on implicit prior probabilities that are proportional to the frequencies of the response categories in the training data (the data used to fit the model). Explicit prior probabilities should be specified when the sample proportions of the response categories in the training data differ substantially from the operational data to be scored. For example, to detect a rare category, it is common practice to use a training set in which the rare categories are overrepresented; without prior probabilities that reflect the true incidence rate, the predicted posterior probabilities for the rare category will be too high. By specifying the correct priors, the posterior probabilities are adjusted appropriately.

The model fit to the DATA= data set in the PROC LOGISTIC statement is the default model used for the scoring. Alternatively, you can save a model fit in one run of PROC LOGISTIC and use it to score new data in a subsequent run. The OUTMODEL= option in the PROC LOGISTIC statement saves the model information in a SAS data set. Specifying this data set in the INMODEL= option of a new PROC LOGISTIC run will score the DATA= data set in the SCORE statement without refitting the model.

Posterior Probabilities and Confidence Limits

Let $\text{[math]}$ be the inverse link function. That is,

$\text{[math]}$

The first derivative of $\text{[math]}$ is given by

$\text{[math]}$

Suppose there are $\text{[math]}$ response categories. Let Y be the response variable with levels $\text{[math]}$ . Let $\text{[math]}$ be a $\text{[math]}$ -vector of covariates, with $\text{[math]}$ . Let $\text{[math]}$ be the vector of intercept and slope regression parameters.

Posterior probabilities are given by

$\text{[math]}$

where the old posterior probabilities ( $\text{[math]}$ ) are the conditional probabilities of the response categories given $\text{[math]}$ , and the old priors ( $\text{[math]}$ ) are the sample proportions of response categories of the training data. To simplify notation, absorb the old priors into the new priors; that is

$\text{[math]}$

The posterior probabilities are functions of $\text{[math]}$ and their estimates are obtained by substituting $\text{[math]}$ by its MLE $\text{[math]}$ . The variances of the estimated posterior probabilities are given by the delta method as follows:

$\text{[math]}$

where

$\text{[math]}$

and the old posterior probabilities $\text{[math]}$ are described in the following sections.

A 100( $\text{[math]}$ )% confidence interval for $\text{[math]}$ is

$\text{[math]}$

where $\text{[math]}$ is the upper 100 $\text{[math]}$ percentile of the standard normal distribution.

Binary and Cumulative Response Models

Let $\text{[math]}$ be the intercept parameters and let $\text{[math]}$ be the vector of slope parameters. Denote $\text{[math]}$ . Let

$\text{[math]}$

Estimates of $\text{[math]}$ are obtained by substituting the maximum likelihood estimate $\text{[math]}$ for $\text{[math]}$ .

The predicted probabilities of the responses are

$\text{[math]}$

For $\text{[math]}$ , let $\text{[math]}$ be a ( $\text{[math]}$ ) column vector with $\text{[math]}$ th entry equal to 1, $\text{[math]}$ th entry equal to $\text{[math]}$ , and all other entries 0. The derivative of $\text{[math]}$ with respect to $\text{[math]}$ are

$\text{[math]}$

The cumulative posterior probabilities are

$\text{[math]}$

Their derivatives are

$\text{[math]}$

In the delta-method equation for the variance, replace $\text{[math]}$ with $\text{[math]}$ .

Finally, for the cumulative response model, use

$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

Generalized Logit Model

Consider the last response level (Y=k+1) as the reference. Let $\text{[math]}$ be the (intercept and slope) parameter vectors for the first $\text{[math]}$ logits, respectively. Denote $\text{[math]}$ . Let $\text{[math]}$ with

$\text{[math]}$

Estimates of $\text{[math]}$ are obtained by substituting the maximum likelihood estimate $\text{[math]}$ for $\text{[math]}$ .

The predicted probabilities are

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

The derivative of $\text{[math]}$ with respect to $\text{[math]}$ are

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

where

$\text{[math]}$

Special Case of Binary Response Model with No Priors

Let $\text{[math]}$ be the vector of regression parameters. Let

$\text{[math]}$

The variance of $\text{[math]}$ is given by

$\text{[math]}$

A 100( $\text{[math]}$ ) percent confidence interval for $\text{[math]}$ is

$\text{[math]}$

Estimates of $\text{[math]}$ and confidence intervals for the $\text{[math]}$ are obtained by back-transforming $\text{[math]}$ and the confidence intervals for $\text{[math]}$ , respectively. That is,

$\text{[math]}$

and the confidence intervals are

$\text{[math]}$

Top of Page