Shared Statistical Concepts


CLASS Statement

  • CLASS variable <(options)>$\ldots $ <variable <(options)>> </ global-options>;

The CLASS statement names the classification variables to be used as explanatory variables in the analysis. These variables enter the analysis not through their values, but through levels to which the unique values are mapped. For more information about these mappings, see the section Levelization of Classification Variables.

If a CLASS statement is specified, it must precede the MODEL statement in high-performance statistical procedures that support a MODEL statement.

If the procedure permits a classification variable as a response (dependent variable or target), the response does not need to be specified in the CLASS statement.

You can specify options either as individual variable options or as global-options. You can specify options for each variable by enclosing the options in parentheses after the variable name. You can also specify global-options for the CLASS statement by placing them after a slash (/). Global-options are applied to all the variables that are specified in the CLASS statement. If you specify more than one CLASS statement, the global-options that are specified in any one CLASS statement apply to all CLASS statements. However, individual CLASS variable options override the global-options.

You can specify the following values for either an option or a global-option (except for the HPLMIXED procedure, which does not support options in this statement):

DESCENDING
DESC

reverses the sort order of the classification variable. If both the DESCENDING and ORDER= options are specified, high-performance statistical procedures order the categories according to the ORDER= option and then reverse that order.

ORDER=DATA | FORMATTED | INTERNAL
ORDER=FREQ | FREQDATA | FREQFORMATTED | FREQINTERNAL

specifies the sort order for the levels of classification variables. This ordering determines which parameters in the model correspond to each level in the data. By default, ORDER=FORMATTED. For ORDER=FORMATTED and ORDER=INTERNAL, the sort order is machine-dependent. When ORDER=FORMATTED is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values.

The following table shows how high-performance statistical procedures interpret values of the ORDER= option.

Value of ORDER=

Levels Sorted By

DATA

Order of appearance in the input data set

FORMATTED

External formatted values, except for numeric variables that have no explicit format, which are sorted by their unformatted (internal) values

FREQ

Descending frequency count (levels that have more observations come earlier in the order)

FREQDATA

Order of descending frequency count, and within counts by order of appearance in the input data set when counts are tied

FREQFORMATTED

Order of descending frequency count, and within counts by formatted value when counts are tied

FREQINTERNAL

Order of descending frequency count, and within counts by unformatted (internal) value when counts are tied

INTERNAL

Unformatted value

For more information about sort order, see the chapter about the SORT procedure in Base SAS Procedures Guide and the discussion of BY-group processing in SAS Language Reference: Concepts.

REF=’level’ | keyword
REFERENCE=’level’ | keyword

specifies the reference level that is used when you specify PARAM= REFERENCE. For an individual (but not a global) variable REF= option, you can specify the level of the variable to use as the reference level. Specify the formatted value of the variable if a format is assigned. For a REF= option or global-option, you can use one of the following keywords. The default is REF=LAST.

FIRST

designates the first ordered level as reference.

LAST

designates the last ordered level as reference.

If you choose a reference level for any CLASS variable, all variables are parameterized in the reference parameterization for computational efficiency. In other words, high-performance statistical procedures apply a single parameterization method to all classification variables.

Suppose that the variable temp has three levels (hot, warm, and cold) and that the variable gender has two levels (M and F). The following statements fit a logistic regression model:


proc hplogistic;
   class gender(ref='F') temp;
   model y = gender gender*temp;
run;

Both CLASS variables are in reference parameterization in this model. The reference levels are F for the variable gender and warm for the variable temp, because the statements are equivalent to the following statements:


proc hplogistic;
   class gender(ref='F') temp(ref=last);
   model y = gender gender*temp;
run;
SPLIT

requests that the columns of the design matrix that correspond to any effect that contains a split classification variable can be selected to enter or leave a model independently of the other design columns of that effect. This option is specific to the HPREG procedure.

Suppose that the variable temp has three levels (hot, warm, and cold), that the variable gender has two levels (M and F), and that the variables are used in a PROC HPREG run as follows:


proc hpreg;
   class temp gender / split;
   model y = gender gender*temp;
run;

The two effects in the MODEL statement are split into eight independent effects. The effect "gender" is split into two effects that are labeled "gender_M" and "gender_F". The effect "gender*temp" is split into six effects that are labeled "gender_M*temp_hot", "gender_F*temp_hot", "gender_M*temp_warm", "gender_F*temp_warm", "gender_M*temp_cold", and "gender_F*temp_cold". The previous PROC HPREG step is equivalent to the following:


proc hpreg;
   model y = gender_M gender_F
             gender_M*temp_hot  gender_F*temp_hot
             gender_M*temp_warm gender_F*temp_warm
             gender_M*temp_cold gender_F*temp_cold;
run;

The SPLIT option can be used on individual classification variables. For example, consider the following PROC HPREG step:


proc hpreg;
   class temp(split) gender;
   model y = gender gender*temp;
run;

In this case, the effect "gender" is not split and the effect "gender*temp" is split into three effects, which are labeled "gender*temp_hot", "gender*temp_warm", and "gender*temp_cold". Furthermore, each of these three split effects now has two parameters that correspond to the two levels of "gender." The PROC HPREG step is equivalent to the following:


proc hpreg;
   class gender;
   model y = gender gender*temp_hot gender*temp_warm gender*temp_cold;
run;

You can specify the following global-options:

MISSING

treats missing values (".", ".A", …, ".Z" for numeric variables and blanks for character variables) as valid values for the CLASS variable.

If you do not specify the MISSING option, observations that have missing values for CLASS variables are removed from the analysis, even if the CLASS variables are not used in the model formulation.

PARAM=keyword

specifies the parameterization method for the classification variable or variables. You can specify the following keywords:

GLM

specifies a less-than-full-rank reference cell coding. This parameterization is used in, for example, the GLM, MIXED, and GLIMMIX procedures in SAS/STAT.

REFERENCE

specifies a reference cell encoding. You can choose the reference value by specifying an option for a specific variable or set of variables in the CLASS statement, or designate the first or last ordered value by specifying a global-option. The default is REF=LAST.

For example, suppose that the variable temp has three levels (hot, warm, and cold), that the variable gender has two levels (M and F), and that the variables are used in a CLASS statement as follows:


   class gender(ref='F') temp / param=ref;

Then F is used as the reference level for gender and warm is used as the reference level for temp.

The GLM parameterization is the default. For more information about how parameterization of classification variables affects the construction and interpretation of model effects, see the section Specification and Parameterization of Model Effects.

TRUNCATE<=n>

specifies the truncation width of formatted values of CLASS variables when the optional n is specified.

If n is not specified, the TRUNCATE option requests that classification levels be determined by using no more than the first 16 characters of the formatted values of CLASS variables.