The QUANTSELECT Procedure

PROC QUANTSELECT Statement

  • PROC QUANTSELECT <options>;

Table 96.1 lists the options available in the PROC QUANTSELECT statement.

Table 96.1: PROC QUANTSELECT Statement Options

option

Description

Data Set Options

DATA=

Names a data set to use for the regression

MAXMACRO=

Sets the maximum number of macro variables to produce

TESTDATA=

Names a data set that contains test data

VALDATA=

Names a data set that contains validation data

ODS Graphics Options

PLOTS=

Produces ODS Graphics displays

Other Options

ALGORITHM=

Specifies an algorithm for estimating the regression parameters

NAMELEN=

Specifies the maximum length of effect names in tables and output data sets

NOPRINT

Suppresses displayed output (including plots)

OUTDESIGN=

Names a data set that contains the design matrix

PARMLABELSTYLE=

Sets the style of parameter names and labels for nested and crossed effects

SEED=

Sets the seed used for pseudorandom number generation


You can specify the following options (shown in alphabetical order) in the PROC QUANTSELECT statement.

ALGORITHM=SIMPLEX | SMOOTH

specifies either the simplex algorithm (ALGORITHM=SIMPLEX) or the smoothing algorithm (ALGORITHM=SMOOTH) for estimating the regression parameters. The smoothing algorithm is computationally much more efficient than the simplex algorithm for fitting models on large data sets. You might consider specifying the ALGORITHM=SMOOTH if your DATA= data set contains more than 5,000 observations and more than 50 regressors. The smoothing algorithm does not support quantile process effect selection or the LASSO selection method. By default, ALGORITHM=SIMPLEX.

DATA=SAS-data-set

names the SAS data set to be used by PROC QUANTSELECT. If the DATA= option is not specified, PROC QUANTSELECT uses the most recently created SAS data set. If the data set contains a variable named _ROLE_, then this variable is used to assign observations for training, validation, and testing roles. See the section Using Validation and Test Data for more information about using the _ROLE_ variable.

MAXMACRO=n

specifies the maximum number of macro variables with selected effects to create. By default, MAXMACRO=100.

PROC QUANTSELECT saves the list of selected effects in a macro variable, &_QRSIND. For example, suppose your input effect list consists of x1x10. Then &_QRSIND would be set to x1 x3 x4 x10 if the first, third, fourth, and tenth effects were selected for the model. This list can be used in the MODEL statement of a subsequent procedure.

If you specify the OUTDESIGN= option in the PROC QUANTSELECT statement, then PROC QUANTSELECT saves the list of columns in the design matrix in a macro variable named &_QRSMOD.

With multiple quantile levels and BY processing, one macro variable is created for each combination of quantile level and BY group, and the macro variables are indexed by the BY-group number and the quantile-level index. You can use the MAXMACRO= option to either limit or increase the number of these macro variables when you are processing data sets with many combinations of quantile level and BY group.

With a single quantile level and no BY-group processing, PROC QUANTSELECT creates the macro variables shown in Table 96.2.

Table 96.2: Macro Variables Created for a Single Quantile Level and No BY Processing

Macro Variable Name

Contains

_QRSIND

Selected effects

_QRSIND1

Selected effects

_QRSINDT1

Selected effects

_QRSIND1T1

Selected effects

_QRSMOD

Selected design matrix columns

_QRSMOD1

Selected design matrix columns

_QRSMODT1

Selected design matrix columns

_QRSMOD1T1

Selected design matrix columns


With multiple quantile levels and BY-group processing, PROC QUANTSELECT creates the macro variables shown in Table 96.3.

Table 96.3: Macro Variables Created for a Multiple Quantile Levels and BY-Group Processing

Macro Variable Name

Contains

_QRSIND

Selected effects for quantile 1 and BY group 1

_QRSINDT1

Selected effects for quantile 1 and BY group 1

_QRSINDT2

Selected effects for quantile 2 and BY group 1

.

 

.

 

.

 

_QRSIND1

Selected effects for quantile 1 and BY group 1

_QRSIND1T1

Selected effects for quantile 1 and BY group 1

_QRSIND1T2

Selected effects for quantile 2 and BY group 1

.

 

.

 

.

 

_QRSIND2

Selected effects for quantile 1 and BY group 2

_QRSIND2T1

Selected effects for quantile 1 and BY group 2

_QRSIND2T2

Selected effects for quantile 2 and BY group 2

.

 

.

 

.

 

_QRSINDmTn

Selected effects for quantile n and BY group m


If you specify the OUTDESIGN= option, PROC QUANTSELECT also creates the macro variables shown in Table 96.4.

Table 96.4: Macro Variables Created When the OUTDESIGN= Option Is Specified

Macro Variable Name

Contains

_QRSMOD

Selected design matrix columns for BY group 1

_QRSMOD1

Selected design matrix columns for BY group 1

_QRSMOD2

Selected design matrix columns for BY group 2

.

 

.

 

.

 

_QRSMODmTn

Selected design matrix columns for quantile n and BY group m


The macros variables in Table 96.5 show the number of quantiles and BY groups:

Table 96.5: Macro Variables Showing the Number of Quantiles and BY Groups

Macro Variable Name

Contains

_QRSNUMBYS

The number of BY groups

_QRSNUMTAUS

The number of quantiles

_QRSBY1NUMTAUS

The number of _QRSIND1Tj macro variables actually made

_QRSBY2NUMTAUS

The number of _QRSIND2Tj macro variables actually made

.

 

.

 

.

 

_QRSNUMBYTAUS

The number of _QRSINDiTj macro variables actually made. This value can be less than _QRSNUMBYS $\times $ _QRSNUMTAUS, and it is less than or equal to MAXMACRO=n.


See the section Macro Variables That Contain Selected Models for more information.

NAMELEN=number

specifies the maximum length of effect names. By default, NAMELEN=20. If you specify a value less than 20, the default is used.

NOPRINT

suppresses all displayed output (including plots).

OUTDESIGN<(options)><=SAS-data-set>

creates a data set that contains the design matrix. By default, the QUANTSELECT procedure includes in the OUTDESIGN data set the $\mb{X}$ matrix that corresponds to all the effects in the selected models. Two schemes for naming the columns of the design matrix are available:

  • In the first scheme, names of the parameters are constructed from the parameter labels that appear in the parameter estimates table. This naming scheme is the default when you do not request BY processing, or when you specify the FULLMODEL option with BY processing.

  • In the second scheme, the design matrix column names consist of a prefix followed by an index. The default name prefix is _X. This scheme is used when you specify the PREFIX= option, or when you specify a BY statement without using the FULLMODEL option; otherwise the first scheme is used.

You can specify the following options in parentheses to control the contents of the OUTDESIGN= data set:

ADDINPUTVARS

includes all the input data set variables in the OUTDESIGN= data set.

ADDVALDATA

includes the VALDATA= data set observations in the OUTDESIGN= data set. This option is ignored if the VALDATA= data set is not specified.

ADDTESTDATA

includes the TESTDATA= data set observations in the OUTDESIGN= data set. This option is ignored if TESTDATA= data set is not specified.

FULLMODEL

includes in the OUTDESIGN= data set parameters that correspond to all effects that are specified in the MODEL statement. By default, only parameters that correspond to the selected model are included.

NAMES

produces a table that associates columns in the OUTDESIGN= data set with the labels of the parameters they represent.

PREFIX<=prefix>

creates the design matrix column names from a prefix followed by an index. The default prefix is _X.

PARMLABELSTYLE=options

specifies how parameter names and labels are constructed for nested and crossed effects.

The following options are available:

INTERLACED <(SEPARATOR=quoted string)>

forms parameter names and labels by positioning levels of classification variables and constructed effects adjacent to the associated variable or constructed effect name and using " * " as the delimiter for both crossed and nested effects. This style of naming parameters and labels is used in the TRANSREG procedure. You can request truncation of the classification variable names used in forming the parameter names and labels by using the CPREFIX= and LPREFIX= options in the CLASS statement. You can use the SEPARATOR= suboption to change the delimiter between the crossed variables in the effect. PARMLABELSTYLE=INTERLACED is not supported if you specify the SPLIT option in an EFFECT statement or a CLASS statement. The following are examples of the parameter labels in this style (Age is a continuous variable, Gender and City are classification variables):

 Age
 Gender male * City Beijing
 City London * Age
SEPARATE

specifies that in forming parameter names and labels, the effect name appears before the levels associated with the classification variables and constructed effects in the effect. You can control the length of the effect name by using the NAMELEN= option in the PROC GLMSELECT statement. In forming parameter labels, the first level that is displayed is positioned so that it starts at the same offset in every parameter label—this enables you to easily distinguish the effect name from the levels when the parameter labels are displayed in a column in the "Parameter Estimates" table. The following are examples of the parameter labels in this style (Age is a continuous variable, Gender and City are classification variables):

 Age
 Gender*City male Beijing
 Age*City    London
SEPARATECOMPACT

requests the same parameter naming and labeling scheme as PARMLABELSTYLE=SEPARATE except that the first level in the parameter label is separated from the effect name by a single blank. This style of labeling is used in the PLS procedure and is the default if you do not specify the PARMLABELSTYLE option. The following are examples of the parameter labels in this style (Age is a continuous variable, Gender and City are classification variables):

 Age
 Gender*City male Beijing
 Age*City London

PLOTS | PLOT <(global-plot-options)> <=plot-request <(options)>>
PLOTS | PLOT <(global-plot-options)> <=(plot-request <(options)> <... plot-request <(options)>>)>

controls the plots that are produced through ODS Graphics. When you specify only one plot-request, you can omit the parentheses around it. Here are some examples:

plots=all
plots=coefficients(unpack)
plots(unpack)=(coef acl crit)

ODS Graphics must be enabled before plots can be requested. For example:

ods graphics on;
proc quantselect plots=all;
   class temp sex / split;
   model depVar = sex sex*temp;
run;

For more information about enabling and disabling ODS Graphics, see the section Enabling and Disabling ODS Graphics in Chapter 21: Statistical Graphics Using ODS.

You can specify the following global-plot-options, which apply to all plots generated by the QUANTSELECT procedure, unless they are altered by specific plot options.

ENDSTEP=n

specifies that the step ranges shown on the horizontal axes of plots terminate at the specified step. By default, the step range shown terminates at the final step of the selection process. If you specify the ENDSTEP= option as both a global-plot-option and as an option for a specific plot-request, then PROC QUANTSELECT uses the ENDSTEP=n option for the specific plot-request.

LOGP | LOGPVALUE

displays the natural logarithm of the entry and removal significance levels when the SELECT= SL option is specified in the MODEL statement.

MAXSTEPLABEL=n

specifies the maximum number of characters beyond which labels of effects on plots are truncated. The default is MAXSTEPLABEL=256.

MAXPARMLABEL=n

specifies the maximum number of characters beyond which parameter labels on plots are truncated. The default is MAXPARMLABEL=256.

STARTSTEP=n

specifies that the step ranges shown on the horizontal axes of plots start at the specified step. By default, the step range shown starts at the initial step of the selection process. If you specify the STATSTEP= option as both a global-plot-option and as an option for a specific plot-request, then PROC QUANTSELECT uses the STARTSTEP=n option for the specific plot-request. The default is STARTSTEP=0.

STEPAXIS=EFFECT | NORMB | NUMBER

specifies the method for labeling the horizontal plot axis. This axis represents the sequence of entering or departing effects. The default is STEPAXIS=EFFECT.

STEPAXIS=EFFECT

labels each step by a prefix followed by the name of the effect that enters or leaves at that step. The prefix consists of the step number followed by a "+" sign or a "–" sign, depending on whether the effect enters or leaves at that step.

STEPAXIS=NORMB

labels the horizontal axis value at step i with the penalty on the parameter estimates at step i, normalized by the penalty on the parameter estimates at the final step. This option is valid only with regularization selection methods.

STEPAXIS=NUMBER

labels each step with the step number.

UNPACK

displays each graph separately. (By default, some graphs can appear together in a single panel.) You can also specify UNPACK as a suboption with CRITERIA and COEFFICIENTS options for specific plot-requests.

The following list describes the specific plot-requests and their options.

ALL

displays all appropriate graphs.

ACL | ACLPLOT <(aclplot-option)>

plots the progression of the average check losses on the training data, and on the test and validation data when these data are provided with the TESTDATA= or VALDATA= options or are produced by using a PARTITION statement. When the PROC QUANTSELECT procedure is applied on multiple quantile levels, the ACL option and its suboptions apply to the ACL plots for each of the quantile levels.

You can specify the following aclplot-option:

STEPAXIS=EFFECT | NORMB | NUMBER

specifies the method for labeling the horizontal plot axis. See the STEPAXIS= option in the global-plot-options for more information.

COEF | COEFFICIENTS | COEFFICIENTPANEL <(coefficient-panel-options)>

displays a panel of two plots for each quantile level. The upper plot shows the progression of the parameter values as the selection process proceeds. The lower plot shows the progression of the CHOOSE= criterion. If no CHOOSE= criterion is in effect, then the AICC criterion is displayed. You can specify the following coefficient-panel-options:

LABELGAP=percentage

specifies the percentage of the vertical axis range that forms the minimum gap between successive parameter labels at the final step of the coefficient progression plot. If the values of more than one parameter at the final step are closer than this gap, then the labels on all but one of these parameters are suppressed. The default is LABELGAP=5.

LOGP | LOGPVALUE

displays the natural logarithm of the entry and removal significance levels when the SELECT= SL option is specified in the MODEL statement.

STEPAXIS=EFFECT | NORMB | NUMBER

specifies the horizontal axis to be used. See the STEPAXIS= option in the global-options for more information.

UNPACK | UNPACKPANEL

displays the coefficient progression and the CHOOSE= criterion progression in separate plots.

CRIT | CRITERIA | CRITERIONPANEL <(criterion-panel-options)>

plots a panel of model fit criteria. If multiple quantile levels apply, the CRITERIA option plots a panel of model fit criteria for each quantile level. The criteria that are displayed are AIC, AICC, and SBC, in addition to any other criteria that are named in the CHOOSE= , SELECT= , STOP= , and STATS= options in the MODEL statement. You can specify the following criterion-panel-options:

STEPAXIS=EFFECT | NORMB | NUMBER

specifies the horizontal axis to be used. See the STEPAXIS= option in the global-options for more information.

UNPACK | UNPACKPANEL

displays each criterion progression on a separate plot.

NONE

suppresses all plots.

SEED=number

specifies an integer that is used to start the pseudorandom number generator for random partitioning of data for training, testing, and validation. If you do not specify a seed or if you specify a value less than or equal to 0, the seed is generated by reading the time of day from the computer’s clock.

TESTDATA=SAS-data-set

names a SAS data set that contains test data. This data set must contain all the effects that are specified in the MODEL statement. Furthermore, when you also specify a BY statement and the TESTDATA= data set contains any of the BY variables, then the TESTDATA= data set must also contain all the BY variables sorted in the order of the BY variables. In this case, only the test data for a specific BY group are used with the corresponding BY group in the analysis data. If the TESTDATA= data set contains none of the BY variables, then the entire TESTDATA= data set is used with each BY group of the analysis data.

If you specify both a TESTDATA= data set and the PARTITION statement, then the testing observations from the DATA= data set are merged with the TESTDATA= data set for testing purposes.

VALDATA=SAS-data-set

names a SAS data set that contains validation data. This data set must contain all the effects that are specified in the MODEL statement. Furthermore, when a BY statement is used and the VALDATA= data set contains any of the BY variables, then the VALDATA= data set must also contain all the BY variables sorted in the order of the BY variables. In this case, only the validation data for a specific BY group are used with the corresponding BY group in the analysis data. If the VALDATA= data set contains none of the BY variables, then the entire VALDATA= data set is used with each BY group of the analysis data.

If you specify both a VALDATA= data set and the PARTITION statement, then the validation observations from the DATA= data set are merged with the VALDATA= data set for validation purposes.