The HPLOGISTIC Procedure

MODEL Statement

  • MODEL response <(response-options)> = <effects> </ model-options>;

  • MODEL events / trials <(response-options)> = <effects> </ model-options>;

The MODEL statement defines the statistical model in terms of a response variable (the target) or an events/trials specification, model effects constructed from variables in the input data set, and options. An intercept is included in the model by default. You can remove the intercept with the NOINT option.

You can specify a single response variable that contains your binary, ordinal, or nominal response values. When you have binomial data, you can specify the events/trials form of the response, where one variable contains the number of positive responses (or events) and another variable contains the number of trials. Note that the values of both events and (trialsevents) must be nonnegative and the value of trials must be positive.

For information about constructing the model effects, see the section Specification and Parameterization of Model Effects of Chapter 4: Shared Statistical Concepts.

There are two sets of options in the MODEL statement. The response-options determine how the HPLOGISTIC procedure models probabilities for binary data. The model-options control other aspects of model formation and inference. Table 9.3 summarizes these options.

Table 9.3: MODEL Statement Options

Option

Description

Response Variable Options

DESCENDING

Reverses the response categories

EVENT=

Specifies the event category

ORDER=

Specifies the sort order

REF=

Specifies the reference category

Model Options

ALPHA=

Specifies the confidence level for confidence limits

ASSOCIATION

Requests association statistics

CL

Requests confidence limits

CTABLE

Requests classification statistics

CUTPOINT=

Specifies a cutpoint for binary classification

DDFM=

Specifies the degrees-of-freedom method

INCLUDE=

Includes effects in all models for model selection

LACKFIT

Requests the Hosmer and Lemeshow goodness-of-fit test

LINK=

Specifies the link function

NOCHECK

Suppresses checking for infinite parameters

NOINT

Suppresses the intercept

OFFSET=

Specifies the offset variable

PRIOR=

Specifies prior probabilities

RSQUARE

Requests a generalized coefficient of determination

START=

Includes effects in the initial model for model selection


Response Variable Options

Response variable options determine how the HPLOGISTIC procedure models probabilities for binary and multinomial data.

You can specify the following response-options by enclosing them in parentheses after the response or trials variable.

DESCENDING
DESC

reverses the order of the response categories. If both the DESCENDING and ORDER= options are specified, PROC HPLOGISTIC orders the response categories according to the ORDER= option and then reverses that order.

EVENT=’category’ | FIRST | LAST

specifies the event category for the binary response model. PROC HPLOGISTIC models the probability of the event category. The EVENT= option has no effect when there are more than two response categories.

You can specify the value (formatted, if a format is applied) of the event category in quotes, or you can specify one of the following:

FIRST

designates the first ordered category as the event. This is the default.

LAST

designates the last ordered category as the event.

For example, the following statements specify that observations with formatted value '1' represent events in the data. The probability modeled by the HPLOGISTIC procedure is thus the probability that the variable def takes on the (formatted) value '1'.

proc hplogistic data=MyData;
   class A B C;
   model def(event ='1') = A B C x1 x2 x3;
run;
ORDER=DATA | FORMATTED | INTERNAL
ORDER=FREQ | FREQDATA | FREQFORMATTED | FREQINTERNAL

specifies the sort order for the levels of the response variable. When ORDER=FORMATTED (the default) for numeric variables for which you have supplied no explicit format (that is, for which there is no corresponding FORMAT statement in the current PROC HPLOGISTIC run or in the DATA step that created the data set), the levels are ordered by their internal (numeric) value. The following table shows the interpretation of the ORDER= option:

ORDER=

Levels Sorted By

DATA

Order of appearance in the input data set

FORMATTED

External formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value

FREQ

Descending frequency count (levels with the most observations come first in the order)

FREQDATA

Order of descending frequency count; by order of appearance in the input data set when counts are tied

FREQFORMATTED

Order of descending frequency count; by formatted value (as above) when counts are tied

FREQINTERNAL

Order of descending frequency count; by unformatted value when counts are tied

INTERNAL

Unformatted value

By default, ORDER=FORMATTED. For the FORMATTED and INTERNAL orders, the sort order is machine-dependent.

For more information about sort order, see the chapter on the SORT procedure in the Base SAS Procedures Guide and the discussion of BY-group processing in SAS Language Reference: Concepts.

REF=’category’ | FIRST | LAST

specifies the reference category for the generalized logit model and the binary response model. For the generalized logit model, each logit contrasts a nonreference category with the reference category. For the binary response model, specifying one response category as the reference is the same as specifying the other response category as the event category. You can specify the value (formatted if a format is applied) of the reference category in quotes, or you can specify one of the following:

FIRST

designates the first ordered category as the reference

LAST

designates the last ordered category as the reference. This is the default.

Model Options

ALPHA=number

requests that confidence intervals for each of the parameters be constructed with confidence level 1–number. The value of number must be between 0 and 1; the default is 0.05.

ASSOCIATION

displays measures of association between predicted probabilities and observed responses for binary or binomial response models. These measures assess the predictive ability of the model. The displayed statistics are the concordance index C (the area under the ROC curve, AUC), Somers’ D statistic (Gini’s coefficient), the Goodman-Kruskal gamma statistic, and Kendall’s tau-a statistic. For more information, see the section Association Statistics.

CL

requests that confidence limits be constructed for each of the parameter estimates. The confidence level is 0.95 by default; this can be changed with the ALPHA= option.

CTABLE<=SAS-data-set>
OUTROC<=SAS-data-set>

displays a table for binary or binomial response models that contains the frequencies of observations that are correctly and incorrectly classified as events and nonevents, the sensitivity, the 1–specificity, the positive and negative predictive values, and the correct classification rate. For more information, see the section Classification Table and ROC Curves.

Classification is carried out by initially binning the predicted probabilities as discussed in the section The Hosmer-Lemeshow Goodness-of-Fit Test. The PRIOR= option does not change the reported predicted probabilities.

Because the number of cutpoints can be very large, you can store the table in an output data set. If you specify a PARTITION statement, then the statistics are computed by their roles, and a Role variable indicates to which partition the computations belong.

CUTPOINT=value

specifies a value between 0 and 1 used for classifying observations when you have a binary or binomial response variable. If the predicted probability of an observation equals or exceeds the cutpoint, the observation is classified as an event; otherwise it is classified as a nonevent. This option affects computation of the misclassification rate and the true positive and true negative fractions in the "Partition Fit Statistics" table. By default, CUTPOINT=0.5.

DDFM=RESIDUAL | NONE

specifies how degrees of freedom for statistical inference be determined in the "Parameter Estimates Table."

The HPLOGISTIC procedure always displays the statistical tests and confidence intervals in the "Parameter Estimates" tables in terms of a t test and a two-sided probability from a t distribution. With the DDFM= option, you can control the degrees of freedom of this t distribution and thereby switch between small-sample inference and large-sample inference based on the normal or chi-square distribution.

The default is DDFM=NONE, which leads to z-based statistical tests and confidence intervals. The HPLOGISTIC procedure then displays the degrees of freedom in the DF column as Infty, the p-values are identical to those from a Wald chi-square test, and the square of the t value equals the Wald chi-square statistic.

If you specify DDFM=RESIDUAL, the degrees of freedom are finite and determined by the number of usable frequencies (observations) minus the number of nonredundant model parameters. This leads to t-based statistical tests and confidence intervals. If the number of frequencies is large relative to the number of parameters, the inferences from the two degrees-of-freedom methods are almost identical.

INCLUDE=n
INCLUDE=single-effect
INCLUDE=(effects)

forces effects to be included in all models. If you specify INCLUDE=n, then the first n effects that are listed in the MODEL statement are included in all models. If you specify INCLUDE=single-effect or if you specify a list of effects within parentheses, then the specified effects are forced into all models. The effects that you specify in the INCLUDE= option must be explanatory effects that are specified in the MODEL statement before the slash (/).

LACKFIT<(DFREDUCE=r NGROUPS=G)>

performs the Hosmer and Lemeshow goodness-of-fit test (Hosmer and Lemeshow, 2000) for binary response models.

The subjects are divided into at most G groups of roughly the same size, based on the percentiles of the estimated probabilities. You can specify G as any integer greater than or equal to 5; by default, G=10. Let the actual number of groups created be g. The discrepancies between the observed and expected number of observations in these g groups are summarized by the Pearson chi-square statistic, which is then compared to a chi-square distribution with gr degrees of freedom. You can specify a nonnegative integer r that satisfies gr $\ge $ 1; by default, r=2.

A small p-value suggests that the fitted model is not an adequate model. For more information, see the section The Hosmer-Lemeshow Goodness-of-Fit Test.

LINK=keyword

specifies the link function for the model. The keywords and the associated link functions are shown in Table 9.4.

Table 9.4: Built-in Link Functions of the HPLOGISTIC Procedure

 

Link

 

LINK=

Function

$g(\mu ) =\eta = $

CLOGLOG | CLL

Complementary log-log

$\log (-\log (1-\mu ))$

GLOGIT | GENLOGIT

Generalized logit

 

LOGIT

Logit

$\log (\mu /(1-\mu ))$

LOGLOG

Log-log

$-\log (-\log (\mu ))$

PROBIT

Probit

$\Phi ^{-1}(\mu )$


For the probit and cumulative probit links, $\Phi ^{-1}(\cdot )$ denotes the quantile function of the standard normal distribution.

If the response variable has more than two categories, the HPLOGISTIC procedure fits a model with a cumulative link function based on the specified link. However, if you specify LINK=GLOGIT, the procedure assumes a generalized logit model for nominal (unordered) data, regardless of the number of response categories.

NOCHECK

disables the checking process that determines whether maximum likelihood estimates of the regression parameters exist. For more information, see the section Existence of Maximum Likelihood Estimates.

NOINT

requests that no intercept be included in the model. An intercept is included by default. The NOINT option is not available in multinomial models.

OFFSET=variable

specifies a variable to be used as an offset to the linear predictor. An offset plays the role of an effect whose coefficient is known to be 1. The offset variable cannot appear in the CLASS statement or elsewhere in the MODEL statement. Observations with missing values for the offset variable are excluded from the analysis.

PRIOR=SAS-data-set
PRIOR=number
PEVENT=number
PRIOR=ALLDATA

specifies prior probabilities (prevalences) that are used for computing posterior predicted probabilities. When you know what percentage of the population has a rare event and you oversample that rare event, specifying the prior probabilities as the prevalence of events in your population enables you to produce posterior probabilities that reflect the population, not the data.

You can specify your priors in a SAS data set in which a _PRIOR_ column contains the prior probabilities. For events/trials MODEL statement syntax, this data set should also include an _OUTCOME_ variable that contains the values EVENT and NONEVENT; for single-trial syntax, this data set should include the response variable that contains the unformatted response categories. Each row of the data set contains a unique response variable level and its prior. For binary and binomial response models, you can instead specify the probability of an event as number. If you also specify a PARTITION statement, you can specify PRIOR=ALLDATA to compute the prevalences as the observed proportions of the response levels gathered across all the roles.

If your response Y takes values $i=1,\ldots ,k$ that have observed empirical probabilities $\mathrm{OldPrior}_ i=\frac{n_ i}{n}$, you specify priors $\mathrm{Prior}_ i$, and your model predicted probabilities are $\hat{p}_ i$, then the posterior predicted probabilities $\mathrm{Post}_ i$ are computed as

\[  \mathrm{Post}_ i = \frac{\hat{p}_ i \frac{\mathrm{Prior}_ i}{\mathrm{OldPrior}_ i}}{\sum _{j=1}^ k\hat{p}_ j \frac{\mathrm{Prior}_ j}{\mathrm{OldPrior}_ j}}  \]

The POST= option in the OUTPUT statement writes the posterior to the output data set. If your priors are identical to the empirical probabilities, then the posteriors are identical to the model-predicted probabilities.

The priors do not affect the model-fitting process, but instead modify the following statistics in the "Partition Fit Statistics" table: false positive fraction, false negative fraction, false response fraction (for multinomial response models), and misclassification rate. The "Classification" table statistics PPV, NPV, and Percent Correct are also adjusted as described in the section Classification Table and ROC Curves.

RSQUARE
R2

requests a generalized coefficient of determination (R square, $R^2$) and a scaled version thereof for the fitted model. The results are added to the "Fit Statistics" table. For more information about the computation of these measures, see the section Generalized Coefficient of Determination.

START=n
START=single-effect
START=(effects)

begins the selection process from the designated initial model for the FORWARD and STEPWISE selection methods. If you specify START=n, then the starting model includes the first n effects that are listed in the MODEL statement. If you specify START=single-effect or if you specify a list of effects within parentheses, then the starting model includes those specified effects. The effects that you specify in the START= option must be explanatory effects that are specified in the MODEL statement before the slash (/). The START= option is not available when you specify METHOD=BACKWARD in the SELECTION statement.