22607 - Preventing excessive time or memory use by PROC LOGISTIC

Usage Note 22607: Preventing excessive time or memory use by PROC LOGISTIC

The time and memory requirements in PROC LOGISTIC are largely a function of the number of parameters in each model and the number of models being fit. Large amounts of time and/or memory can be required by PROC LOGISTIC if there are many model parameters when model selection is not used, or if there are too many candidate models (perhaps also with too many parameters in the larger candidates) when model selection is used. Model selection methods are invoked by the SELECTION= option in the MODEL statement. Remember that PROC LOGISTIC uses an iterative, maximum likelihood algorithm to fit each model, and the number of iterations needed cannot be known in advance. See Computational Resources in the Details section of the PROC LOGISTIC documentation (SAS Note 22930) for details about time and memory requirements.

Beginning in SAS^® 9.4, the HPLOGISTIC and HPGENSELECT procedures can also be used to fit logistic models. Logistic models can also be fit in the LOGSELECT and GENSELECT procedures in SAS^® Viya^®. These procedures are multithreaded, which can allow for faster performance as the number of observations becomes large, particularly when run in a distributed mode on an appliance that consists of a cluster of nodes. Faster performance might also be obtained on a single machine with several processors. For more information, see "Shared Concepts and Topics" and the HPLOGISTIC or HPGENSELECT chapters in SAS/STAT User's Guide: High-Performance Procedures.

Discussed below are some common causes of excessive time and/or memory use and possible remedies.

Too many response levels

Check to make sure that the response variable has only two or only a few (usually 10 or fewer) distinct values, or that events/trials syntax is used in the MODEL statement. To check, run PROC FREQ on the response variable. Here is an example:

     proc freq data=mydata nlevels;
        table y / noprint;
        run;

The response variable should never be continuous. Attempting to model a continuous response can easily require excessive time and memory because it adds many intercept parameters to the model. Other modeling procedures (REG, GLM, GLIMMIX, GENMOD, etc.) should be considered if the response is continuous.

Too many parameters

The time and memory required can be large if there are many variables or effects in the model. Even if relatively few variables are specified, the presence of CLASS variables (that is, variables listed in the CLASS statement) or the specification of many interaction or nested effects can result in a large number of model parameters. This issue can occur with a deceptively short MODEL statement.

For example, even if there is only one variable in the model, the number of parameters is large if the number of levels of the variable is large and the variable is a CLASS variable. To check the number of levels in the CLASS variables, run these PROC FREQ statements that list all of the CLASS variables in the TABLE statement:

     proc freq data=mydata nlevels;
        table list-all-CLASS-variables / noprint;
        run;

The following model involves only 10 variables, but the use of the vertical bars includes all possible interactions up to and including the 10-way interaction, resulting in a model with over 1,000 parameters. The same MODEL statement could result in an even larger number of parameters if some of the variables were CLASS variables with multiple levels:

     proc logistic;
        model y = a|b|c|d|e|f|g|h|i|j;
        run;

A model that includes only main effects and two-way interactions is much smaller and frequently all that is needed. Use the @2 modifier to request this model:

     proc logistic;
        model y = a|b|c|d|e|f|g|h|i|j@2;
        run;

When there are many candidate variables or effects, consider using the SELECTION= option to screen variables and to build a model. See comments on the use of SELECTION= later in this note:

Large input data set

If the input data set is too large to be held in memory, PROC LOGISTIC must spend additional time reading it from disk. As the data set size increases, there can be a large jump in execution time after exceeding the size at which the data can be held in memory. With very large data sets, such as with a very large number of observations, insufficient memory might be available. The MULTIPASS option in the PROC LOGISTIC statement can be used to reread the data as needed at the cost of increased execution time. Consider model selection via the SELECTION= option using a random subset of the data (SAS Note 22978). You can then validate the model (SAS Note 22597) on the remaining data. See comments on the use of SELECTION= later in this note. Consider using one of the multithreaded procedures mentioned above.

Using full-rank parameterization

When the CLASS statement is specified, full-rank parameterizations of CLASS variables such as the default PARAM=EFFECT or PARAM=REF are less efficient than PARAM=GLM as described in SAS Note 33354.

Initial parameters too far from final parameters

When the initial parameters are far from the final parameters, the procedure might need many iterations to reach the solution. Although this generally isn't a problem when the model has a small number of parameters and the data set is not large, the time needed can become large when this is not true. You might be able to reduce the number of iterations needed, and therefore the time to fit the model, by using the following strategy: Fit the desired model using a relatively small, random subset of the data (SAS Note 22978). Make sure that all levels of all CLASS variables appear in the subset. Use the OUTEST= option to save the parameter estimates. Using a small, random subset of the data that can be held in memory should require little time and allow you to get good starting values. Now you can run PROC LOGISTIC on your full data set with the INEST= option to use the saved estimates as starting values. By doing this, you are hopefully starting close to the solution so that fewer iterations are necessary. Note that it is not possible to know in advance the number of iterations that are needed to find the solution for any given data set and model.

Model selection using SELECTION= with many variables

Using the SELECTION= option can also require a great deal of time. SELECTION=BACKWARD, while most efficient when used with the FAST option, is often not a possibility in this situation because it starts with the largest, most difficult model to fit. Use FORWARD or STEPWISE instead. The time that is needed depends on the number of candidate effects and the entry and removal criteria (SLE= and SLS=). As you add candidate effects, increase the SLE= setting, or decrease the SLS= setting, you tend to increase the number of steps in the selection process, which requires more time. And as the models get larger in subsequent steps, the memory that is needed to fit each model increases. The time that is needed also depends on the data itself. Two data sets with the same number of variables and observations, using the same entry and removal criteria, can require vastly different numbers of steps in the selection process. For example, one data set might have only one significant variable, requiring one step, while another data set might have many significant variables that are correlated, requiring many entry and removal steps. If there are a large number of variables (such as over 30), use the SELECTION=FORWARD, STOP=1, and DETAILS options together. The Analysis of Effects Eligible for Entry table enables you to efficiently eliminate variables that have no association with the response. You can then select a model from those that remain. Also, try to remove any variable that is highly correlated with any other variable, because it does not contribute anything new to the model. When there are fewer than about 60 candidate predictors, SELECTION=SCORE is more efficient than SELECTION=STEPWISE, because no model fitting is done. Only the global model score statistic is given for each possible model. When you are using the SELECTION= option, several additional options can be useful in limiting the number of models considered and reducing the time and memory used. See the descriptions of the BEST=, INCLUDE=, SEQUENTIAL, START=, STOP=, and STOPRES options in the MODEL statement. An alternative method that might be faster is the LASSO method available in PROC HPGENSELECT beginning is SAS^® 9.4M3 (TS1M3). Another alternative is the method based on adaptive splines that is available in PROC ADAPTIVEREG.

Using the EXACT statement

Exact logistic regression should be used only with relatively small data sets. The Monte Carlo method (specify the EXACTOPTIONS METHOD=NETWORKMC; statement) can be used for small to moderate-size problems. Note that these methods usually produce more conservative results, so they are mostly needed when sample sizes are small and the p-values from the usual (asymptotic) tests are less than 0.10. If the usual p-values are larger than 0.15, the exact results are likely to be about the same. If insufficient memory is available, you can try specifying the EXACTOPTIONS ONDISK; statement to use disk space rather than memory at the cost of additional execution time.

Exact logistic regression is extremely memory- and computation-intensive method and can take a great deal of time and memory. It is not possible to know in advance how much time or memory a given problem will take. Specify the EXACTOPTIONS STATUSTIME=x; statement in order to have a status line printed to the SAS log every x seconds. See Computational Resources for Exact Logistic Regression in the Details section of the PROC LOGISTIC documentation (SAS Note 22930) for additional discussion of time and memory requirements and suggestions for minimizing them.

Using the STRATA statement (without EXACT)

The method needed to fit the conditional model is similar to the method used for exact analysis and is more computationally intensive than for the unconditional model. So, increased execution time can expected. If the OUTPUT statement is also specified to request additional statistics (such as predicted probabilities or regression diagnostics), the algorithm required for computing those statistics is itself intensive and can greatly increase the time needed with conditional models. The time needed grows with the data set size and sizes of the strata. Omitting the OUTPUT statement prevents this additional time. The CHECK=ALL option in the STRATA statement can also add a large amount of time. To prevent this additional time, use the default CHECK=COVARIATES or CHECK=NONE. The conditional model can also be fit in PROC PHREG and might require less time. See the example in the PROC PHREG documentation (SAS Note 22930).

Requesting ODS graphs with too many graph elements

If the data set has a large number of observations that cause the requested graph to display a point or a line for each observation, producing the graph can require a large amount of time or memory. However, a graph with such a large number of elements would be so crowded as to generally be unhelpful. Use any available options to create a graph with a limited number of graph elements. Also note that specifying an option such as PLOTS=EFFECT in many procedures produces more than just the requested graph but also generates several other default graphs. Even if the requested graph does not contain many elements, one or more of the default graphs might. To produce only the requested graph, use the ONLY option. For example, PLOTS(ONLY)=EFFECT.

Operating System and Release Information

Product Family	Product	System	SAS Release
			Reported	Fixed*
SAS System	SAS/STAT	All	n/a

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:	low
Topic:	SAS Reference ==> Procedures ==> LOGISTIC Analytics ==> Categorical Data Analysis

Date Modified:	2024-02-15 15:49:28
Date Created:	2002-12-16 10:56:38

Support

Usage Note 22607: Preventing excessive time or memory use by PROC LOGISTIC

Operating System and Release Information