|
Ordering Data Values: Procedure Defaults and Options by Jason R. Sharpe Jason R. Sharpe is a technical support analyst at SAS Institute. His areas of expertise include base SAS¨ software procedures and the macro facility. Jason has BS degrees in electrical engineering and computer engineering from North Carolina State University.
Abstract
Introduction Procedure
Defaults and Options
Understanding how each one individually affects your output makes it easier to understand their effect when you combine one or more of the conditions. Host-Specific Collating Sequences
data sample;
input x $1.;
cards;
1
a
A
z
Z
\
;
/* OUTPUT 1a-1b */
proc freq;
tables x / nopercent nocum;
run;
EBCDIC ASCII
X Frequency X Frequency
------------- -------------
a 1 1 1
z 1 A 1
A 1 Z 1
\ 1 \ 1
Z 1 a 1
1 1 z 1
Output 1a Output 1b
Output 1a-1b -- Demonstration of Host-Specific Collating SequencesOutput 1 shows the differences in the ordering scheme for the output produced using an EBCDIC collating sequence as opposed to an ASCII collating sequence. EBCDIC and ASCII are two of the The BY Statement
proc sort data=sasuser.houses out=houses;
by style;
/* OUTPUT 2 */
proc freq data=houses;
by style;
tables bedrooms / nopercent;
run;
------------------- STYLE=CONDO ---------------------
Cumulative
BEDROOMS Frequency Frequency
--------------------------------
2 2 2
3 1 3
4 1 4
------------------- STYLE=RANCH ---------------------
Cumulative
BEDROOMS Frequency Frequency
--------------------------------
1 1 1
2 1 2
3 2 4
Output 2 Partial Listing of the BY StatementÕs Effect on Ordering
A separate table is generated for each value of STYLE. Thus, each value of STYLE is a BY group. The tables are ordered by the ascending alphabetic values of STYLE. The SORT procedure has a DESCENDING option that you can use to sort the data in descending order. It also has other options such as ASCII or EBCDIC that you can use to control the collating sequence. The BY statement has the options DESCENDING and NOTSORTED. If you have used PROC SORT with the DESCENDING option, then you must use the DESCENDING option in a BY statement in any subsequent procedure step. The NOTSORTED option is useful when you want to list a variable in a BY statement whose values are grouped together, but not sorted in either ascending or descending order, or when you use a collat For the remainder of this article, the term BY group will denote that a BY statement is being used. For the most part, a BY group will be like a sub-data set in itself. In certain situations, this allows ordering schemes to vary from one BY group to the ne A Single Classification Variable
Procedure Statement/Option
CHART HBAR, VBAR, PIE, BLOCK, STAR statements
GROUP=, SUBGROUP= options
FREQ TABLES statement
MEANS CLASS statement
REPORT DEFINE statement with GROUP, ORDER, or ACROSS option
SUMMARY CLASS statement
TABULATE CLASS statement
Table 1 Procedures that Accept Defining Classification Variables
For most procedures, with the exception of PROC REPORT, the default ordering scheme for a single class variable with no formats or ordering options applied is ascending. The following example demonstrates this default action: /* OUTPUT 3 */ proc means data=sasuser.houses nway mean; class style; var sqfeet; run; Default behavior with PROC REPORT will be discussed later in the section "ORDER=FORMATTED".
Analysis Variable : SQFEET
STYLE N Obs Mean
-----------------------------
CONDO 4 1688.75
RANCH 4 1251.25
SPLIT 3 1370.00
TWOSTORY 4 1458.75
-----------------------------
Output 3 Default Ordering For A Single Classification Variable
Multiple Classification Variables Notice in the following table that precedence in PROC REPORT is set by the order in which the variables appear in the COLUMN statement, and not by the order of the DEFINE statements. Also notice that in the PROC TABULATE syntax, the order of precedence is set by the order in which the variables appear on the TABLE statement, not the CLASS statement. Table 2 summarizes variable precedence with the highest order class variable being A:
Procedure Syntax example
CHART proc chart;
vbar A / group=B subgroup=C;
FREQ proc freq;
tables A * B * C * D;
MEANS proc means;
class A B C D;
REPORT proc report;
column A B C D;
define C / group;
define A / group;
define B / group;
define D / group;
SUMMARY proc summary;
class A B C D;
TABULATE proc tabulate;
class D C B A;
table A,B, C * D;
Table 2 Distinguishing Class Variables and Precedence
The overall ordering scheme when multiple class variables are used is determined first by developing a master order for each class variable for the entire data set or BY group, and then by applying the master order to each subgroup of the hierarchy. This means that the order does not change from one class subgroup to the next. In the following example, the master order for STYLE and BEDROOMS is determined separately. STYLE is the highest- order class variable. Thus, BEDROOMS will form subgroups of each value of STYLE. The order for BEDROOMS is not determined again for each value of STYLE. Instead, the order is taken from the master order determined before generating any subgroups. Table 3 summarizes the master ordering scheme for the variables STYLE and BEDROOMS when using ORDER=FREQ with PROC TABULATE: STYLE BEDROOMS RANCH 2 CONDO 4 TWOSTORY 3 SPLIT 1 Table 3 Master Ordering Schemes for STYLE and BEDROOMS The master ordering schemes are developed independently for STYLE and BEDROOMS. The schemes in Table 3 were developed using ORDER=FREQ, which is discussed in detail in the section ORDER=FREQ. The following example demonstrates the use of the above master ordering schemes when STYLE and BEDROOMS are used simultaneously:
/* OUTPUT 4 */
proc tabulate data=sasuser.houses format=3. noseps
order=freq;
class style bedrooms;
table style*bedrooms, n / rts=23;
run;
| | N |
|----------------------------+---|
|STYLE BEDROOMS | |
|RANCH 2 | 1|
| 3 | 2|
| 1 | 1|
|CONDO 2 | 2|
| 4 | 1|
| 3 | 1|
|TWOSTORY 2 | 2|
| 4 | 2|
|SPLIT 4 | 1|
| 3 | 1|
| 1 | 1|
----------------------------------
Output 4 Ordering Multiple Classification Variables
Although it may not be apparent, STYLE and BEDROOMS are ordered based on the master ordering schemes in Table 3. If you repeat this example using the PRINTMISS option in the TABLE statement in PROC TABULATE, the master ordering scheme is apparent:
/* OUTPUT 5 */
proc tabulate data=sasuser.houses format=3. noseps
order=freq;
class style bedrooms;
table style*bedrooms, n / rts=23 printmiss;
run;
| | N |
|---------------------+---|
|STYLE BEDROOMS | |
|RANCH 2 | 1|
| 4 | .|
| 3 | 2|
| 1 | 1|
|CONDO 2 | 2|
| 4 | 1|
| 3 | 1|
| 1 | .|
|TWOSTORY 2 | 2|
| 4 | 2|
| 3 | .|
| 1 | .|
|SPLIT 2 | .|
| 4 | 1|
| 3 | 1|
| 1 | 1|
---------------------------
Output 5 Verifying Master Ordering Sequences
The order for each class variable, regardless of whether it is in a subgroup, is based on the master order for that variable.
Formats For the FREQ, MEANS, SUMMARY, and TABULATE procedures the default is to list the values of the class variables in ascending order by their internal values, not their formatted values. Internal, or unformatted, values are the actual values that are stored in the SAS data set. The default for PROC REPORT in Version 6 of the SAS System is to list the values of the classification variables by their formatted values. Please note this may change in a future release to be consistent with other procedures that offer the ORDER= option. This special case is discussed in more detail in the section ORDER=FORMATTED. PROC CHART behaves differently than the other procedures that use classification variables. The PROC CHART ordering scheme when formats are applied depends on whether the classification variable is character or numeric. Table 4 summarizes the default ordering schemes for the base SAS procedures when formats are applied to the classification variables:
PROCEDURE VARIABLE FORMAT ORDERED BY
CHART Numeric No Internal value
Yes Internal value
Character No Internal value
Yes Formatted value
FREQ Num or Chr Y or N Internal value
MEANS Num or Chr Y or N Internal value
REPORT Numeric No Formatted value* **
Yes Formatted value**
Character No Internal value
Yes Formatted value**
SUMMARY Num or Chr Y or N Internal value
TABULATE Num or Chr Y or N Internal value
* If no format is assigned, PROC REPORT assumes a BESTw.d format.
** The default may change to INTERNAL in a future release of the SAS System.
Table 4 Default Ordering Algorithms When Formats Are Applied
Lowest Internal Values proc format; value numf 0,3,4='GROUP A' 1,2='GROUP B'; /* OUTPUT 6 */ proc report data=sasuser.houses nowd headskip; column bedrooms; define bedrooms / group format=numf. order=internal; run; BEDROOMS GROUP B GROUP A Output 6 Ordering by the Lowest Internal Value EncounteredGROUP B appears before GROUP A because the lowest internal value encountered for GROUP B was 1, and for GROUP A was 3. The lowest possible internal value that could be in GROUP A is 0, but it did not exist in the data. Another difficult situation with lowest internal values occurs when a format contains groups or ranges that are independent from one another. Consider the following example:
data sample;
length dept $ 5;
input dept id;
cards;
PET 100
PET 110
PET 120
PET 199
PLANT 200
PLANT 210
PLANT 220
PLANT 299
;
proc format;
value idfmt
100='CAT' 110='DOG' 120='FISH' 199='OTHER'
200='CACTUS' 210='IVY' 220='FERN' 299='OTHER';
/* OUTPUT 7 */
proc tabulate data=sample noseps;
class dept id;
table dept*id, n;
format id idfmt.;
run;
| | N |
|--------------+------------|
|DEPT ID | |
|PET CAT | 1.00|
| DOG | 1.00|
| FISH | 1.00|
| OTHER | 1.00|
|PLANT OTHER | 1.00|
| CACTUS | 1.00|
| IVY | 1.00|
| FERN | 1.00|
-----------------------------
Output 7 Independent Groups of Format Ranges in a Single Format
Notice in Output 7 that for DEPT=PET, the value OTHER appears last in the sequence; it appears first for DEPT=PLANT. This is because the master ordering sequence for ID is determined before subgroups are created. Since ID=199 and ID=299 have the same forma Missing Values Most procedures that use class variables provide the MISSING option, which enables you to specify whether missing values are to be considered valid class levels. PROC FREQ has an additional option, MISSPRINT, which displays missing class levels but does not use them in calculating statistics. The timing in which a procedure classifies missing values as valid or invalid can also yield different results. Formats can group missing class levels with nonmissing class levels, causing nonmissing class levels to be treated as missing class levels. The MEANS, REPORT, SUMMARY, and TABULATE procedures classify missing values as valid or invalid before applying formats. Thus, if you do not specify MISSING, and you use a format range that groups nonmissing class levels with missing class levels, then the nonmissing class levels will still be treated as valid, whereas the missing class levels will not. PROC FREQ, on the other hand, applies formats before classifying missing values as valid or invalid. Thus, if MISSING or MISSPRINT is not used and a format range would group nonmissing class levels with missing class levels, both the nonmissing and missing class levels would be considered invalid. The following example demonstrates the difference in effect between PROC FREQ and PROC REPORT:
proc format;
value bedfmt 1='ONE' 2='TWO' other='OTHER';
data houses;
set sasuser.houses end=last;
output;
if last then do;
bedrooms=.;
output;
end;
format bedrooms bedfmt.;
/* OUTPUT 8a */
proc print data=houses;
title "WORK.HOUSES";
var bedrooms;
format bedrooms;
/* OUTPUT 8b */
proc freq data=houses;
title1 "PROC FREQ";
title2 "Without MISSING Specified";
tables bedrooms / nocum nopercent;
run;
/* OUTPUT 8c */
proc report data=houses nowd headline;
title1 "PROC REPORT";
title2 "Without MISSING Specified";
column bedrooms n;
define bedrooms / group width=8;
run;
WORK.HOUSES
OBS BEDROOMS
1 2
2 1
3 2
4 4
5 3
6 4
7 3
8 3
9 2
10 4
11 3
12 2
13 1
14 4
15 2
16 .
Output 8a WORK.HOUSES Data Set
Output 8b-8c Difference in Timing for Evaluating Missing Values
PROC FREQ PROC REPORT
Without MISSING Specified Without MISSING Specified
BEDROOMS Frequency BEDROOMS N
-------------------- -------------------
ONE 2 ONE 2
TWO 5 OTHER 8
TWO 5
Frequency Missing = 9
Output 8b Output 8c
Notice that PROC FREQ does not include the class level OTHER, whereas PROC REPORT does. This is because PROC FREQ applies the BEDFMT format to BEDROOMS before classifying missing values as invalid. PROC REPORT, on the other hand, classifies missing values as invalid before applying the format, allowing the nonmissing class levels that would normally be grouped with the missing class levels to still be treated as valid. You can verify this by observing the frequency counts. A total of 16 observations reside in the WORK.HOUSES data set. PROC FREQ reports nine invalid class levels, whereas PROC REPORT treats only one class level as invalid. If the lowest internal value in a format group is a missing value, then the entire group will be treated as missing. Because missing values are considered the lowest possible internal value for either numeric or character variables, missing values cause an entire format group to have the lowest internal; thus, missing values rank first when ordering by internal values. To demonstrate this point, add the MISSING option to the previous example:
In the results for PROC FREQ, notice that the group OTHER is ordered first in the sequence; it appears second in the PROC REPORT results. By default, PROC FREQ orders by internal values, and PROC REPORT orders by formatted values. Also notice the frequency counts for OTHER are the same in both procedures. In PROC FREQ it corresponds to the missing frequency in Output 8. For PROC REPORT, the frequency increases by one to account for the single missing value that was not included in Output 8.
BY Variables
Notice that even though the values of 1, 2, and 4 are in the same format range, they are not grouped together because the values in the range do not represent a sequence. Thus, three BY groups are created rather than two.
The ORDER= Option Procedure Default CHART Not Available FREQ INTERNAL MEANS INTERNAL REPORT FORMATTED* SUMMARY INTERNAL TABULATE INTERNAL * Please note that due to the variation of this default, it may be changed to ORDER=INTERNAL in a future release of the SAS System. Table 5 Defaults for the ORDER= Option
ORDER=INTERNAL
Notice that the values for bedroom are listed as ONE, TWO, THREE, FOUR, corresponding to the internal order of 1, 2, 3, 4. If you donÕt specify ORDER=INTERNAL, the values would be listed in alphabetical order by the formatted values: FOUR, ONE, THREE, TWO.
ORDER=FORMATTED
Notice that the values for BEDROOMS are ordered as FOUR, ONE, THREE, TWO (alphabetical order of the formatted values). If ORDER=FORMATTED is used and a class variable does not have a format associated with it, then the classification variable will be ordered by its internal values. Special Case: PROC REPORT--As mentioned previously, ORDER=FORMATTED is the default setting for PROC REPORT. This setting can pose problems because PROC REPORT uses a default format of BESTw. for numeric variables with unspecified formats. The following example demonstrates what can happen:
Notice that in Output 13a, the default format is BEST9. If a format is not specified for a numeric variable, then a BESTw. format is applied. Since the default setting for PROC REPORT is ORDER=FORMATTED, the values will be sorted using the formatted values. This can be complicated by character comparisons that sometimes give misleading results. In Output 13a the values actually being compared are " 1", " 2", " 3", "1.5", and "2.5". Since a blank character sorts before a number or a period (.), the values " 1", " 2", and " 3" sort before the values 1.5 or 2.5. This problem is corrected in Output 13b because there are no leading blanks in the comparisons. This problem is also corrected in Output 13c because the internal values, not the formatted values, are used to determine the order.
ORDER=DATA
Notice that the order for STYLE is neither ascending nor descending. STYLE=RANCH appears first because it was the first value encountered in the input data set; STYLE=SPLIT appears second because it was encountered second, and so on. If you use a BY statement, the order is reset at the beginning of each new BY group, as if a new data set were being processed. When you use multiple classification variables, the order is determined independently for each classification variable across the entire data set or BY group. The ordering scheme is then developed first using the master order of the highest order variable, then the next highest, and so on. The following example demonstrates this point:
Classification Variables Notice that the order for STYLE is RANCH, SPLIT, CONDO, TWOSTORY. The order for BEDROOMS is 2, 1, 4, 3. This is not clearly apparent in the output because no value of STYLE has all four values of BEDROOMS. The order of values is established independently for each variable, and not according to the subgroups. You can verify this by adding the PRINTMISS option to the TABLE statement, or by comparing the order of STYLE and BEDROOMS in the SASUSER.HOUSES data set (listed in the appendix) with Output 15. ORDER=FREQ If two class levels have the same frequency, a secondary ordering algorithm will be used. All base SAS procedures, except PROC FREQ, use ORDER=DATA as a secondary ordering method. If duplicate frequency counts occur with PROC FREQ when using ORDER=FREQ, then PROC FREQ will use the ORDER=FORMATTED method as a secondary ordering method. Furthermore for PROC FREQ, if a format has not been applied, then the tie will be broken using the ORDER=INTERNAL method. If you use PROC REPORT with ORDER=FREQ and DESCENDING, and a tie occurs, then the ORDER=DATA method will be used, but the levels will be listed in reverse order of occurrence in the data. The following table summarizes the behavior across the base SAS procedures that use the ORDER= option:
Procedure Direction Format Event of Tie
FREQ Descending No ORDER=INTERNAL
Yes ORDER=FORMATTED
MEANS Descending Y or N ORDER=DATA
REPORT Ascending Y or N ORDER=DATA
Descending Y or N ORDER=DATA*
SUMMARY Descending Y or N ORDER=DATA
TABULATE Descending Y or N ORDER=DATA
* If you specify DESCENDING, then both the primary method, ORDER=FREQ, and
the secondary method, ORDER=DATA, list levels in descending order.
Table 6 Secondary Ordering Algorithms for ORDER=FREQ
The following series of examples demonstrates the differences in behavior when you use ORDER=FREQ:
You should note two main differences in Example 16. First, PROC TABULATE and PROC FREQ list the data values in descending order of the frequency counts. PROC REPORT requires you to specify the DEFINE statement option, DESCENDING. Second, PROC TABULATE and PROC REPORT use the ORDER=DATA method to handle tied values, whereas PROC FREQ uses ORDER=FORMATTED. If you donÕt specify a format, PROC FREQ uses ORDER=INTERNAL. Now, look at what happens in Output 16h with PROC REPORT when DESCENDING is applied. Not only are data values listed in descending order of frequency counts, but the data values that had the same frequency are listed in reverse order of occurrence. In other words, ORDER=DATA is used to handle ties, but the data values are listed in reverse order. As with ORDER=DATA, when you add a BY statement, the order for classification variables is established as though each BY group were a separate data set. When you use multiple classification variables with ORDER=FREQ, the order is determined independently for each classification variable. The overall ordering scheme is then applied first by using the order of the highest order variable, then the next-highest order variable, and so on. Now, consider the frequencies for BATHS:
If you make BEDROOMS the highest order variable and BATHS the next-highest order variable, you expect the rows to be ordered by the descending frequency of values first in BEDROOMS, then in BATHS. The following example demonstrates this:
Notice that the N statistic does not list the frequencies of the combinations in descending order. Instead, the order is set first by the descending frequencies of BEDROOMS and then by BATHS. You can verify this by comparing the order of each variable against the orders used in Output 16a and Output 17. The master ordering sequence for BEDROOMS is 2, 4, 3, 1. The master ordering sequence for BATHS is 1, 3, 2.5, 1.5, 2.
PROC TABULATE PROC FREQ
Without Format Without Format
| | N |
|--------------+---| BEDROOMS Frequency
|BEDROOMS | | --------------------
|2 | 5| 2 5
|4 | 4| 3 4
|3 | 4| 4 4
|1 | 2| 1 2
--------------------
Output 16a Output 16b
-------------------------------------------------------------------------------
PROC REPORT PROC REPORT
Without Format Without Format
Without DESCENDING With DESCENDING
BEDROOMS N BEDROOMS N
------------------- -------------------
1 2 2 5
4 4 3 4
3 4 4 4
2 5 1 2
Output 16c Output 16d
-------------------------------------------------------------------------------
PROC TABULATE PROC FREQ
With Format With Format
| | N |
|--------------+---| BEDROOMS Frequency
|BEDROOMS | | --------------------
|TWO | 5| TWO 5
|FOUR | 4| FOUR 4
|THREE | 4| THREE 4
|ONE | 2| ONE 2
--------------------
Output 16e Output 16f
-------------------------------------------------------------------------------
PROC REPORT PROC REPORT
With Format With Format
Without DESCENDING With DESCENDING
BEDROOMS N BEDROOMS N
------------------- -------------------
ONE 2 TWO 5
FOUR 4 THREE 4
THREE 4 FOUR 4
TWO 5 ONE 2
Output 16g Output 16h
Output 16a-16h Differences When Using ORDER=FREQ
Supplemental Ordering Options
| | N |
|-----------------+---|
|BATHS | |
|1 | 5|
|3 | 4|
|2.5 | 3|
|1.5 | 2|
|2 | 1|
-----------------------
Output 17 Demonstration of ORDER=FREQ with BATHS
PROC CHART does not have an ORDER= option, but it does have the options DESCENDING and MIDPOINTS=. These can be used to affect the ordering of the classification variables. For more information on these supplemental ordering options, refer to the SAS Procedures Guide, Version 6, Third Edition; SAS Guide to the REPORT Procedure, Usage and Reference, Version 6, First Edition; and SAS Guide to TABULATE Processing, Second Edition.
| | N |
|-----------------+---|
|BEDROOMS BATHS | |
|2 1 | 3|
| 1.5 | 1|
| 2 | 1|
|4 3 | 2|
| 2.5 | 2|
|3 3 | 2|
| 2.5 | 1|
| 1.5 | 1|
|1 1 | 2|
-----------------------
Output 18 Demonstration of ORDER=FREQ with Multiple Classification Variables
Conclusion
Appendix options nodate nonumber nolabel ls=64 ps=90; title; The following is the SASUSER.HOUSES data set, which is used to supply the data for most examples in this article: STYLE SQFEET BEDROOMS BATHS STREET PRICE RANCH 1250 2 1.0 Sheppard Avenue $64,000 SPLIT 1190 1 1.0 Rand Street $65,850 CONDO 1400 2 1.5 Market Street $80,050 TWOSTORY 1810 4 3.0 Garris Street $107,250 RANCH 1500 3 3.0 Kemble Avenue $86,650 SPLIT 1615 4 3.0 West Drive $94,450 SPLIT 1305 3 1.5 Graham Avenue $73,650 CONDO 1390 3 2.5 Hampshire Avenue $79,350 TWOSTORY 1040 2 1.0 Sanders Road $55,850 CONDO 2105 4 2.5 Jeans Avenue $127,150 RANCH 1535 3 3.0 State Highway $89,100 TWOSTORY 1240 2 1.0 Fairbanks Circle $69,250 RANCH 720 1 1.0 Nicholson Drive $34,550 TWOSTORY 1745 4 2.5 Highland Road $102,950 CONDO 1860 2 2.0 Arcata Avenue $110,700 |