The time and memory requirements in PROC LOGISTIC are largely a function of the number of parameters in each model and the number of models being fit. Large amounts of time and/or memory can be required by PROC LOGISTIC if there are many model parameters when model selection is not used, or if there are too many candidate models (perhaps also with too many parameters in the larger candidates) when model selection is used. Model selection methods are invoked by the SELECTION= option in the MODEL statement. Keep in mind that PROC LOGISTIC uses an iterative, maximum likelihood algorithm to fit each model, and the number of iterations needed cannot be known in advance. See *Computational Resources* in the Details section of the PROC LOGISTIC documentation for details on time and memory requirements.

Beginning in SAS^{®} 9.4, the HPLOGISTIC and HPGENSELECT procedures can also be used to fit logistic models. These procedures are multithreaded which can allow for faster performance, particularly when run in a distributed mode on an appliance that consists of a cluster of nodes. Faster performance might also be obtained on a single machine with several processors. For more information, see "Shared Concepts and Topics" and the HPLOGISTIC or HPGENSELECT chapters in *SAS/STAT User's Guide: High-Performance Procedures*.

Discussed below are some common causes of excessive time and/or memory use and possible remedies.

**Too many response levels**- Check to make sure that the response variable has only two or only a few (usually 10 or fewer) distinct values, or that events/trials syntax is used in the MODEL statement. To check, run PROC FREQ on the response variable. For example:
proc freq data=mydata nlevels; table y / noprint; run;

The response variable should never be continuous. Attempting to model a continuous response can easily require excessive time and memory because it will add many intercept parameters to the model. Other modeling procedures (REG, GLM, GLIMMIX, GENMOD, etc.) should be considered if the response is continuous.

**Too many parameters**- The time and memory required can be large if there are many variables or effects in the model. Even if there are relatively few variables specified, the presence of CLASS variables (that is, variables listed in the CLASS statement) or the specification of many interaction or nested effects can result in a large number of model parameters. This can happen with a deceptively short MODEL statement.
For example, even if there is only one variable in the model, the number of parameters will be large if the number of levels of the variable is large and the variable is a CLASS variable.

The following model involves only 10 variables, but the use of the vertical bars includes all possible interactions up to and including the 10-way interaction, resulting in a model with over 1,000 parameters. The same MODEL statement could result in an even larger number of parameters if some of the variables were CLASS variables with multiple levels.

proc logistic; model y = a|b|c|d|e|f|g|h|i|j; run;

A model including only main effects and two-way interactions is much smaller and frequently all that is needed. Use the @2 modifier to request this model.

proc logistic; model y = a|b|c|d|e|f|g|h|i|j@2; run;

When there are many candidate variables or effects, consider using the SELECTION= option to screen variables and to build a model. See comments on the use of SELECTION= later in this note.

**Large input data set**- If the input data set is too large to be held in memory, PROC LOGISTIC must spend additional time reading it from disk. As the data set size increases, there can be a large jump in execution time after exceeding the size at which the data can be held in memory. With very large data sets, such as with a very large number of observations, insufficient memory might be available. The MULTIPASS option in the PROC LOGISTIC statement can be used to reread the data as needed at the cost of increased execution time. Consider model selection via the SELECTION= option using a random subset of the data. You can then validate the model on the remaining data. See comments on the use of SELECTION= later in this note.
**Using full-rank parameterization**- When the CLASS statement is specified, full-rank parameterizations of CLASS variables such as the default PARAM=EFFECT or PARAM=REF are less efficient than PARAM=GLM as described in this note.
**Initial parameters too far from final parameters**- When the initial parameters are far from the final parameters, the procedure may need many iterations to reach the solution. While this generally isn't a problem when the model has a small number of parameters and the data set is not large, the time needed can become large when this is not true. You may be able to reduce the number of iterations needed, and therefore the time to fit the model, by using the following strategy: Fit the desired model using a relatively small, random subset of the data. Make sure that all levels of all CLASS variables appear in the subset. Use the OUTEST= option to save the parameter estimates. Using a small, random subset of the data that can be held in memory should require little time and allow you to get good starting values. Now you can run PROC LOGISTIC on your full data set with the INEST= option to use the saved estimates as starting values. By doing this, you are hopefully starting close to the solution so that fewer iterations will be necessary. Note that it is not possible to know in advance the number of iterations that will be needed to find the solution for any given data set and model.
**Using SELECTION= with many variables**- Using the SELECTION= option can also require a great deal of time. SELECTION=BACKWARD, while most efficient when used with the FAST option, is very often not a possibility in this situation because it starts with the largest, most difficult model to fit. Use FORWARD or STEPWISE instead. The time that is needed depends on the number of candidate effects and the entry and removal criteria (SLE= and SLS=). As you add candidate effects, increase the SLE= setting, or decrease the SLS= setting, you tend to increase the number of steps in the selection process, requiring more time. And as the models get larger in subsequent steps, the memory that is needed to fit each model increases. The time that is needed also depends on the data itself. Two data sets with the same number of variables and observations, using the same entry and removal criteria, can require vastly different numbers of steps in the selection process. For example, one data set might have only one significant variable, requiring one step, while another data set might have many significant variables that are correlated, requiring many entry and removal steps. If there are a large number of variables (say, over 30), use the SELECTION=FORWARD, STOP=1, and DETAILS options together. The Analysis of Effects Eligible for Entry table enables you to efficiently eliminate variables that have no association with the response. You can then select a model from those that remain. Also, try to remove any variable that is highly correlated with any other variable, because it will not contribute anything new to the model. When there are fewer than about 60 candidate predictors, SELECTION=SCORE is more efficient than SELECTION=STEPWISE, because no model fitting is done. Only the global model score statistic is given for each possible model. When you are using the SELECTION= option, several additional options can be useful in limiting the number of models considered and reducing the time and memory used. See the descriptions of the BEST=, INCLUDE=, SEQUENTIAL, START=, STOP=, and STOPRES options in the MODEL statement. An alternative method that might be faster is the LASSO method available in PROC HPGENSELECT beginning is SAS 9.4 TS1M3.
**Using the EXACT statement**- Exact logistic regression should be used only with relatively small data sets. The Monte Carlo method (specify the EXACTOPTIONS METHOD=NETWORKMC; statement) can be used for small to moderate-size problems. Note that these methods usually produce more conservative results, so they are mostly needed when sample sizes are small and the
*p*-values from the usual (asymptotic) tests are less than 0.10. If the usual*p*-values are larger than 0.15, the exact results are likely to be about the same. If insufficient memory is available, you can try specifying the EXACTOPTIONS ONDISK; statement to use disk space rather than memory at the cost of additional execution time.Exact logistic regression is extremely memory- and computation-intensive method and can take a great deal of time and memory. It is not possible to know in advance how much time or memory a given problem will take. Specify the EXACTOPTIONS STATUSTIME=

*x*; statement in order to have a status line printed to the SAS log every*x*seconds. See*Computational Resources for Exact Logistic Regression*in the Details section of the PROC LOGISTIC documentation for additional discussion of time and memory requirements and suggestions for minimizing them. **Using the STRATA statement (without EXACT)**- The method needed to fit the conditional model is similar to the method used for exact analysis and is more computationally intensive than for the unconditional model. So, increased execution time can expected. If the OUTPUT statement is also specified to request additional statistics (such as predicted probabilities or regression diagnostics), the algorithm required for computing those statistics is itself very intensive and can greatly increase the time needed with conditional models. The time needed grows with the data set size and sizes of the strata. Omitting the OUTPUT statement will prevent this additional time. The CHECK=ALL option in the STRATA statement also can add a large amount of time. To prevent this additional time use the default CHECK=COVARIATES or CHECK=NONE. The conditional model can also be fit in PROC PHREG and may require less time. See the example in the PROC PHREG documentation.

Product Family | Product | System | SAS Release | |

Reported | Fixed* | |||

SAS System | SAS/STAT | All | n/a |

Type: | Usage Note |

Priority: | low |

Topic: | SAS Reference ==> Procedures ==> LOGISTIC Analytics ==> Categorical Data Analysis |

Date Modified: | 2018-06-05 17:56:54 |

Date Created: | 2002-12-16 10:56:38 |