Shared Concepts and Topics |
A classification variable is a variable that enters the statistical analysis or model not through its values, but through its levels. The process of associating values of a variable with levels is termed levelization.
During the process of levelization, observations that share the same value are assigned to the same level. The manner in which values are grouped can be affected by the inclusion of formats. The sort order of the levels can be determined with the ORDER= option in the procedure statement. With the GENMOD, GLMSELECT, and LOGISTIC procedures, you can also control the sorting order separately for each variable in the CLASS statement.
Consider the data on nine observations in Table 18.1. The variable A is integer valued, and variable X is a continuous variable with a missing value for the fourth observations. The fourth and fifth columns of Table 18.1 apply two different formats to the variable X.
Obs |
A |
x |
format x 3.0 |
format x 3.1 |
---|---|---|---|---|
1 |
1 |
1.09 |
1 |
1.1 |
2 |
1 |
1.13 |
1 |
1.1 |
3 |
1 |
1.27 |
1 |
1.3 |
4 |
2 |
. |
. |
. |
5 |
2 |
2.26 |
2 |
2.3 |
6 |
2 |
2.48 |
2 |
2.5 |
7 |
3 |
3.34 |
3 |
3.3 |
8 |
3 |
3.34 |
3 |
3.3 |
9 |
3 |
3.14 |
3 |
3.1 |
By default, levelization of the variables groups observations by the formatted value of the variable, except for numerical variables where no explicit format is provided. These are sorted by their internal value. The levelization of the four columns in table Table 18.1 leads to the level assignment in Table 18.2.
A |
X |
format x 3.0 |
format x 3.1 |
|||||
---|---|---|---|---|---|---|---|---|
Obs |
Value |
Level |
Value |
Level |
Value |
Level |
Value |
Level |
1 |
2 |
1 |
1.09 |
1 |
1 |
1 |
1.1 |
1 |
2 |
2 |
1 |
1.13 |
2 |
1 |
1 |
1.1 |
1 |
3 |
2 |
1 |
1.27 |
3 |
1 |
1 |
1.3 |
2 |
4 |
3 |
2 |
. |
. |
. |
. |
. |
. |
5 |
3 |
2 |
2.26 |
4 |
2 |
2 |
2.3 |
3 |
6 |
3 |
2 |
2.48 |
5 |
2 |
2 |
2.5 |
4 |
7 |
4 |
3 |
3.34 |
7 |
3 |
3 |
3.3 |
6 |
8 |
4 |
3 |
3.34 |
7 |
3 |
3 |
3.3 |
6 |
9 |
4 |
3 |
3.14 |
6 |
3 |
3 |
3.1 |
5 |
The ORDER= option in the PROC statement specifies the sorting order for the levels of CLASS variables. When the default ORDER=FORMATTED is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values. To order numeric class levels with no explicit format by their BEST12. formatted values, you can specify this format explicitly for the CLASS variables.
The following table shows how values of the ORDER= option are interpreted.
Value of ORDER= |
Levels Sorted By |
---|---|
DATA |
order of appearance in the input data set |
FORMATTED |
external formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value |
FREQ |
descending frequency count; levels with the most observations come first in the order |
INTERNAL |
unformatted value |
For FORMATTED and INTERNAL values, the sort order is machine dependent. For more information about sort order, see the chapter on the SORT procedure in the Base SAS Procedures Guide and the discussion of BY-group processing in SAS Language Reference: Concepts.
The GLMSELECT, LOGISTIC, and GENMOD procedures support a MISSING option in the CLASS statement. When this option is in effect, missing values ('.' for a numeric variable and blanks for a character variable) are included in the levelization and are assigned a level. Table 18.3 displays the results of levelizing the values in Table 18.1 when the MISSING option is in effect.
A |
X |
format x 3.0 |
format x 3.1 |
|||||
---|---|---|---|---|---|---|---|---|
Obs |
Value |
Level |
Value |
Level |
Value |
Level |
Value |
Level |
1 |
2 |
1 |
1.09 |
2 |
1 |
2 |
1.1 |
2 |
2 |
2 |
1 |
1.13 |
3 |
1 |
2 |
1.1 |
2 |
3 |
2 |
1 |
1.27 |
4 |
1 |
2 |
1.3 |
3 |
4 |
3 |
2 |
. |
1 |
. |
1 |
. |
1 |
5 |
3 |
2 |
2.26 |
5 |
2 |
3 |
2.3 |
4 |
6 |
3 |
2 |
2.48 |
6 |
2 |
3 |
2.5 |
5 |
7 |
4 |
3 |
3.34 |
8 |
3 |
4 |
3.3 |
7 |
8 |
4 |
3 |
3.34 |
8 |
3 |
4 |
3.3 |
7 |
9 |
4 |
3 |
3.14 |
7 |
3 |
4 |
3.1 |
6 |
When the MISSING option is not specified, or for procedures whose CLASS statement does not support this option, it is important to understand the implications of missing values for your statistical analysis. When a SAS/STAT procedure levelizes the CLASS variables, an observation for which a CLASS variable has a missing value is excluded from the analysis. This is true regardless of whether the variable is used to form the statistical model. Consider, for example, the case where some observations contain missing values for variable A but the records for these observations are otherwise complete with respect to all other variables in the statistical models. The analysis results from the following statements do not include any observations for which variable A contains missing values, even though A is not specified in the MODEL statement:
class A B; model y = B x B*x;
Many statistical procedures print a "Number of Observations" table that shows the number of observations read from the data set and the number of observations used in the analysis. You should pay careful attention to this table—especially when your data set contains missing values—to ensure that no observations are unintentionally excluded from the analysis.
Copyright © 2009 by SAS Institute Inc., Cary, NC, USA. All rights reserved.