|
Ordering Data Values: Dummy Data and Mapping Techniques
by Jason R. Sharpe
Abstract
Introduction
Ordering Techniques Dummy Data
Define the Problem
/* OUTPUT 1 */
class style baths;
/ rts=10 misstext='0'; | | STYLE | | |-----------------------------------| | | CONDO | RANCH | SPLIT |TWOSTORY| |---------+--------+--------+--------+--------+ | |N |MEAN |N |MEAN |N |MEAN |N |MEAN | |---------+--+-----+--+-----+--+-----+--+-----+ |BATHS | | | | | | | | | |1.0 |0 | 0| 2| 985| 1| 1190| 2| 1140| |1.5 |1 | 1400| 0| 0| 1| 1305| 0| 0| |2.0 |1 | 1860| 0| 0| 0| 0| 0| 0| |2.5 |2 | 1748| 0| 0| 0| 0| 1| 1745| |3.0 |0 | 0| 2| 1518| 1| 1615| 1| 1810| ----------------------------------------------- Output 1 Default Ordering Sequence for Dummy Data Example Suppose you want to order the data values of STYLE so they appear in the following order: CONDO, SPLIT, RANCH, and TWOSTORY, while maintaining the numeric order for BATHS. The desired order for STYLE is neither ascending nor descending alphabetically, and the values in the input data set are not in the desired order. Thus, neither procedure defaults nor options can produce the desired result. To solve the problem, append a dummy data set before the original data set to obtain the order with ORDER=DATA. Create the Dummy Data Set data dummy;
length style $ 8; CONDO 1 SPLIT 1.5 RANCH 2
TWOSTORY 2.5 Notice that the value of TWOSTORY is repeated at the end of the dummy data
set. The value is repeated to fill in the value of STYLE for the extra value of
BATHS. This is important because, by default, procedures ignore observations in
which a class variable is missing. Thus, frequency counts, sums, and other
statistics could have incorrect results. Alternatively, if the MISSING option is
used to keep observations with missing class levels, then missing levels appear
in the output. Concatenating without BY Groups data houses;
set dummy sasuser.houses(in=is_valid);
if is_valid then valid=1; Example 2 Concatenating the Dummy Data Set without BY Groups A partial listing of observations from the WORK.HOUSES data set follows. The first five observations are from the WORK.DUMMY data set. STYLE BATHS SQFEET VALID CONDO 1.0 . . SPLIT 1.5 . . RANCH 2.0 . . TWOSTORY 2.5 . . TWOSTORY 3.0 . . RANCH 1.0 1250 1 SPLIT 1.0 1190 1 CONDO 1.5 1400 1 Notice that this DATA step creates a new variable named VALID that distinguishes valid observations in the SASUSER.HOUSES data set from dummy observations in the WORK.DUMMY data set. This is necessary if you plan to base any calculations on frequency counts and need to prevent the dummy observations from being counted. Concatenating with BY Groups /* create work.houses by sorting sasuser.houses */ proc sort data=sasuser.houses out=houses; by bedrooms; /* append dummy in front of each by group */ data houses; /* loop once for each observation in work.houses */ do i1=1 to nobs1; set houses point=i1 nobs=nobs1;
/* check for the beginning of a new by group */
if bedrooms ne last_bed then do;
/* loop once for each observation in dummy */
do i2=1 to nobs2;
set dummy point=i2 nobs=nobs2;
/* set valid=. to indicate a dummy ob */
valid=.;
/* set all analysis variables to missing */
sqfeet=.;
/* output dummy observation to work.houses */
output;
end;
/* output valid observation to work.houses */
set houses point=i1;
valid=1;
output;
end;
/* output valid observation to work.houses */
else output;
/* prepare last_bed for next by group check */
last_bed = bedrooms;
end; /* force data step to stop because point= is used */ stop; Notice that SQFEET is set to missing prior to the SET statement for the WORK.DUMMY data set. For correct results, analysis variables must be set to missing to prevent their values from being retained in the observations from the WORK.DUMMY data set. If there are many analysis variables, this can be cumbersome to code. Arrays, the macro facility, or the SQL procedure can reduce the code needed to set analysis variables to missing, but a discussion of these tools is beyond the scope of this article. The desired values for the analysis variables are retrieved from the SET statement for the WORK.HOUSES data set. Because a BY statement cannot be used with the SET statement option POINT=, a retained variable, LAST_BED, is used to determine when a BY group has changed. The following is a partial listing of the resulting data set:
BEDROOMS STYLE BATHS SQFEET VALID
1 CONDO 1.0 . .
1 SPLIT 1.5 . .
1 RANCH 2.0 . .
1 TWOSTORY 2.5 . .
1 TWOSTORY 3.0 . .
1 SPLIT 1.0 1190 1
1 RANCH 1.0 720 1
2 CONDO 1.0 . .
2 SPLIT 1.5 . .
2 RANCH 2.0 . .
2 TWOSTORY 2.5 . .
2 TWOSTORY 3.0 . .
2 RANCH 1.0 1250 1
2 CONDO 1.5 1400 1
2 TWOSTORY 1.0 1040 1
2 TWOSTORY 1.0 1240 1
2 CONDO 2.0 1860 1
Notice that the observations from the DUMMY data set appear at the beginning of each BY group and that for these observations the value for VALID is missing. The observations from the SASUSER.HOUSES data set are distinguished by VALID=1. Generate the Report
/* OUTPUT 4 */
proc tabulate data=houses noseps
class style baths;
/ rts=10 misstext='0'; | | STYLE | | |-----------------------------------| | | CONDO | SPLIT | RANCH |TWOSTORY| |---------+--------+--------+--------+--------+ | |N |MEAN |N |MEAN |N |MEAN |N |MEAN | |---------+--+-----+--+-----+--+-----+--+-----+ |BATHS | | | | | | | | | |1.0 |0 | 0| 1| 1190| 2| 985| 2| 1140| |1.5 |1 | 1400| 1| 1305| 0| 0| 0| 0| |2.0 |1 | 1860| 0| 0| 0| 0| 0| 0| |2.5 |2 | 1748| 0| 0| 0| 0| 1| 1745| |3.0 |0 | 0| 1| 1615| 2| 1518| 1| 1810| Output 2 Desired Ordering Sequence Obtained with the Dummy Data Method The final ordering sequence in Output 2 shows the values of STYLE in the desired order of CONDO, SPLIT, RANCH, and TWOSTORY, while maintaining an ascending numeric sequence for BATHS. Notice that even though dummy observations were added to the SASUSER.HOUSES data set, the frequency counts are not affected because these frequency counts are based on the non-missing values of VALID. In addition to controlling an ordering scheme, dummy data can also be used to guarantee that all possible values for a class variable will appear in the report regardless of whether they appear in the original data. A detailed discussion of this can found in Observations¨, Volume 2, Number 1, Fourth Quarter 1992, pages 64-69. Mapping Techniques
Define the Problem /* OUTPUT 3 */ proc tabulate data=sasuser.houses noseps order=data;
class style baths;
/ rts=10 misstext='0'; | | STYLE | | |-----------------------------------| | | RANCH | SPLIT | CONDO |TWOSTORY| |---------+--------+--------+--------+--------+ | |N |MEAN |N |MEAN |N |MEAN |N |MEAN | |---------+--+-----+--+-----+--+-----+--+-----+ |BATHS | | | | | | | | | |1.0 |2 | 985| 1| 1190| 0| 0| 2| 1140| |1.5 |0 | 0| 1| 1305| 1| 1400| 0| 0| |3.0 |2 | 1518| 1| 1615| 0| 0| 1| 1810| |2.5 |0 | O| 0| 0| 2| 1748| 1| 1745| |2.0 |0 | 0| 0| 0| 1| 1860| 0| 0| Output 3 Default Ordering Sequence for Mapping Example Notice that BATHS is not in numeric order. Instead, it reflects the sequence of the values in the input data set. The goal is to maintain the order for STYLE achieved by using ORDER=DATA and to order BATHS in ascending numeric order. The easiest solution is to use PROC REPORT rather than PROC TABULATE, because PROC REPORT lets you assign a different ORDER= setting to each classification variable. In this situation, BATHS is defined with the ORDER=INTERNAL option, and STYLE is defined with the ORDER=DATA option. The following example demonstrates the PROC REPORT alternative: /* OUTPUT 4 */ proc report data=sasuser.houses headline spacing=1 nowd; column baths style,('--' sqfeet sqfeet=avgsqft) ('ALL/--' sqfeet=all_n sqfeet=all_mean); define baths / group format=5.1 order=internal left; define style / across order=data "STYLE/--"; define sqfeet / n format=2.0 "N"; define avgsqft / mean format=5.0 "MEAN"; define all_n / n format=2.0 "N"; define all_mean / mean format=5.0 "MEAN"; run;
STYLE
------------------------------------
RANCH SPLIT CONDO TWOSTORY ALL
------- -------- -------- -------- --------
BATHS N MEAN N MEAN N MEAN N MEAN N MEAN
-----------------------------------------------------
1.0 2 985 1 1190 0 0 2 1140 5 1088
1.5 0 0 1 1305 1 1400 0 0 2 1353
2.0 0 0 0 0 1 1860 0 0 1 1860
2.5 0 0 0 0 2 1748 1 1745 3 1747
3.0 2 1518 1 1615 0 0 1 1810 4 1615
Output 4 Desired Ordering Scheme Generated by PROC REPORT
This example solves the problem by simply choosing a different procedure. To
demonstrate mapping techniques, however, letÕs assume that PROC TABULATE must be
used for a required report style. Since there are no options with PROC TABULATE
that produce the desired ordering scheme, mapping techniques must be used. Non-
Dynamic Mapping Techniques Original New Format Format Internal Value Internal Value Range Label RANCH 1 1 RANCH SPLIT 2 2 SPLIT CONDO 3 3 CONDO TWOSTORY 4 4 TWOSTORY Table 1 Required Mapping of STYLE The next example demonstrates how to explicitly recode the values of STYLE. This example emphasizes the actual mapping of the values. More dynamic techniques are generally preferred when many data values are involved. A method for dynamic processing is illustrated in the section, "Mapping with Dynamic Techniques." /* create an informat and format to map the values */ proc format; invalue style 'RANCH'=1
'SPLIT'=2
'CONDO'=3
'TWOSTORY'=4;
value style 1='RANCH'
2='SPLIT'
3='CONDO'
4='TWOSTORY';
/* change old internal values to new sequential values */ data houses;
set sasuser.houses; Example 4 Mapping without Dynamic Techniques In this example, numeric informats and formats are used to establish a numeric order when sorted. Although character informats and formats would require less space to store, they will not work because numbers stored in character variables do not sort numerically. For example, a character string Ô11Õ sorts before Ô2Õ, whereas a numeric 2 sorts before a numeric 11. The INPUT function matches the value of STYLE to a specific range in the STYLE. informat and assigns the corresponding informat label to the variable STYLE2. STYLE is then dropped, and STYLE2 is renamed to STYLE. To complete the mapping, the STYLE. format is permanently associated with STYLE using a FORMAT statement. Notice that the same name is used for both the informat and format. This is possible because the SAS System determines whether to interpret the name as an informat or as a format based on how it is used in the syntax. For example, if STYLE. is used in an INPUT function, the SAS System looks for an informat. Likewise, if STYLE. is used on a PUT statement, then it is interpreted as a reference to a format. Original Values of WORK.HOUSES WORK.HOUSES STYLE in after with SASUSER.HOUSES Remapping Style Format Applied RANCH 1 RANCH SPLIT 2 SPLIT CONDO 3 CONDO TWOSTORY 4 TWOSTORY Table 2 SASUSER.HOUSES before and after Mapping STYLE Notice that columns 1 and 3 are identical. The actual internal values are shown in column 2. The order is based on these internal values. Thus, when the format is applied, the desired mapping has been achieved. Example 4 is not considered dynamic because it requires that you know values for STYLE and that you specifically code each possible value in the PROC FORMAT step. If there are hundreds of values to code, this task would be very tedious. Fortunately, there are dynamic methods that can produce the same result. Mapping with Dynamic Techniques PROC FORMAT control data sets provide the ability to create a format using a
SAS data set as input. This allows you to map data values without knowing what
the data values are since you just want to display the values as they occur in
the input data set. Since PROC TABULATE orders class variable values in
ascending order by default, only STYLE needs to PROC FREQ with the ORDER=DATA option creates an output data set with STYLE arranged in the desired order. The output data set is then modified to conform to the structure of a control data set. This example achieves the desired mapping without specifying variable values in the code: /* build the key data set using order=data */ proc freq data=sasuser.houses order=data; tables style / out=key(keep=style) noprint; /* build the control data set */ data control;
set key;
/* Output A numeric informat observation */
start =style;
/* Output A numeric format observation */
start =put(_n_,8.);
/* separate the informat from the format by sorting */
/* because the informat and format must be contiguous */
/* for proc format to correctly use them */
proc sort data=control; /* generate format and informat using the cntlin= */ /* option with proc format */ proc format cntlin=control;
/* remap the style values in master data set */
data houses; Example 5 Mapping with Dynamic Techniques This process creates a data set with a single observation for each data value to be mapped. These values must exist in the data set in the order desired for the report. This is easily achieved by creating an output data set with PROC FREQ. A control data set with two observations for each unique data value is then created. The first observation is for a numeric informat, the second for a numeric format. The informat observations are separated from the format observations by sorting the control data set by TYPE. This is necessary because PROC FORMAT requires that all informat and format observations be contiguous in the control data set. The informat and format are then created by running PROC FORMAT with the CNTLIN= option. Finally, the mapping is performed by using the INPUT function with the new informat. The original values are restored by the FORMAT statement. The same result is achieved with both Example 5 and Example 4. The hard- coded approach in Example 4 requires less code for this example, but it does require that you know what the data values are. If there are many data values to map, use the dynamic approach in Example 5 which requires far less code and no program maintenance. Using PROC FORMAT control data sets is an essential tool for many ordering problems. For more information on using PROC FORMAT control data sets, refer to SAS Procedures Guide, Version 6, Third Edition and to "Creating Formats with Control Data Sets Using the FORMAT Procedure," Observations, First Quarter 1992. Generate the Report Finally, use the mapped data set in a procedure. The following example is a modification of Example 3, using the new internal values and the corresponding format in conjunction with the ORDER=INTERNAL option: /* OUTPUT 5 */ proc tabulate data=houses noseps;
class style baths;
/ rts=10 misstext='0';
| | STYLE |
| |-----------------------------------|
| | RANCH | SPLIT | CONDO |TWOSTORY|
|---------+--------+--------+--------+--------+
| |N |MEAN |N |MEAN |N |MEAN |N |MEAN |
|---------+--+-----+--+-----+--+-----+--+-----+
|BATHS | | | | | | | | |
|1.0 |2 | 985| 1| 1190| 0| 0| 2| 1140|
|1.5 |0 | 0| 1| 1305| 1| 1400| 0| 0|
|2.0 |0 | 0| 0| 0| 1| 1860| 0| 0|
|2.5 |0 | 0| 0| 0| 2| 1748| 1| 1745|
|3.0 |2 | 1518| 1| 1615| 0| 0| 1| 1810
|ALL |4 | 1251| 3| 1370| 4| 1689| 4| 1459|
-----------------------------------------------
Output 5 Desired Ordering Sequence for Mapping Problem
Notice that the desired ordering scheme of BATHS in numeric order and STYLE in an ORDER=DATA sequence is achieved without any effect on the statistics. Note that using ORDER=INTERNAL in the PROC TABULATE statement is not necessary because it is the default for this procedure. Common Ordering Problems
Ordering Formatted SAS Date Values
/* create sample data set */
data sample; 06JAN93 /* Wednesday - SAS date value: 12059 */ 08JAN93 /* Friday - SAS date value: 12061 */ 09JAN93 /* Saturday - SAS date value: 12062 */ 11JAN93 /* Monday - SAS date value: 12064 */ 14JAN93 /* Thursday - SAS date value: 12067 */ 17JAN93 /* Sunday - SAS date value: 12070 */ 19JAN93 /* Tuesday - SAS date value: 12072 */ 21JAN93 /* Thursday - SAS date value: 12074 */ 22JAN93 /* Friday - SAS date value: 12075 */ ; /* OUTPUT 6 */ proc chart data=sample;
hbar date / discrete space=0 axis=0 1 2 nostat;
format date weekdate3.; Here, the day of the week and SAS date value for each date are provided for clarity. By default, PROC CHART orders the data values by their internal values.
DATE
|
----------+---------+
1 2
Frequency
Output 6 Complications Due to Lowest Internal Values
Notice that the days of the week do not appear in a standard calendar sequence or in alphabetical order. These values are ordered by their internal, numeric SAS date values. When a single format label applies to more than one internal value, only the lowest internal value is considered when ordering the group. The value of Wed is first in the sequence because the lowest internal SAS date value falls on a Wednesday (06JAN93). Solutions to the problem are attempted with dummy data and then with mapping. In this situation, mapping is the technique of choice because PROC CHART ignores the dummy observations if the VALID variable is used to maintain correct frequency counts. The following two examples show the results of attempting to correct the problem with dummy data and then with mapping.
Dummy Data Solution for Formatted /* generate the dummy data set */ data dummy; do date='03JAN60'D to '09JAN60'D;
output; /* concatenate WORK.DUMMY with WORK.SAMPLE */ data sample2;
set dummy sample(in=is_valid); /* OUTPUT 6 */ proc chart data=sample2; hbar date / discrete space=0 axis=0 1 2 nostat missing freq=valid format date weekdate3.; run;
The required dummy SAS date values are generated from the dates 03JAN60 through
09JAN60, corresponding to Sun through Sat respectively. However, dummy data do
not work for PROC CHART. Normally, the VALID variable is necessary to maintain
the original frequency counts. PROC CHART does not use an observation with a
missing value for a FREQ= or SUMVAR= variable. Since VALID is missing on every
observation in the DUMMY data set and the FREQ=VALID option is used, all the
observations in the DUMMY data set are ignored. The option MISSING is not
helpful, since MISSING applies only to missing values in the chart variable, not
the FREQ= or SUMVAR= variables. Thus, using dummy data results in the same
ordering as in Output 6 and is, therefore, insufficient for solving this sort of
ordering problem for PROC CHART. For more information on how PROC CHART handles
missing values, refer to page 169 of SAS Technical Report P-222: Changes and
Enhancements to Base SAS Software, Mapping Alternative for Formatted SAS Date Values Fortunately, the mapping technique does eliminate the ordering problem for PROC CHART. The following example illustrates: /* map the sas date values to numeric weekdays using */ /* the weekday function */ data sample2;
set sample; /* create a day of the week format */ proc format; value dayfmt 1='Sun' 2='Mon' 3='Tue' 4='Wed' 5='Thu' 6='Fri' 7='Sat'; /* OUTPUT 7 */ proc chart data=sample2;
hbar date / discrete space=0 axis=0 1 2 nostat;
format date dayfmt.; In this example, the original date values are mapped to numeric values 1 through 7 as returned by the WEEKDAY function. The user-defined format then displays the day the number corresponds to. Output 7 shows the results of the mapping method. DATE Sun |********** Mon |********** Tue |********** Wed |********** Thu |******************** Fri |******************** Sat |**********
|
----------+---------+
1 2
Frequency
Output 7 Results of Mapping Alternative
Sometimes the mapping method is insufficient for solving ordering problems. An example would be forcing all possible values of a class variable to appear in a certain order when all the values may or may not exist in the original data set. A specific discussion of this situation can be found in Observations, Fourth Quarter 1992, pages 64-69. Obtaining a Non-Sequential Ordering Scheme In situations where the desired order for the classification variable is not sequential, you can use dummy data or mapping. Your choice depends on a number of factors, such as which procedure will be used or the number of data set variables and data values to control. Generally, dummy data is the method of choice when there are few class variables and data values to order. The mapping approach proves more useful in cases with many class variables, but when only a few need non-sequential ordering schemes. The following two examples demonstrate the solution using dummy data and then mapping. The goal is to have BATHS ordered as 1.0, 2.0, 3.0, 1.5, 2.5 and STYLE ordered as CONDO, SPLIT, RANCH, and TWOSTORY. Dummy Data Solution for a Non-Sequential Order The following example demonstrates solving the problem using the dummy data method: /* create the dummy data set */ data dummy;
length style $ 8; CONDO 1 SPLIT 2 RANCH 3
TWOSTORY 1.5 /* concatenate dummy with sasuser.houses */ data houses;
set dummy sasuser.houses(in=is_valid);
if is_valid then valid=1; /* OUTPUT 8 */ proc tabulate data=houses noseps order=data;
class style baths;
mean*f=5.)
/ rts=10 misstext='0';
format baths 3.1; Example 6 Dummy Data for Non-Sequential Scheme In this case, the dummy data approach proves to be simple. The only change from the solution for Example 1 is the creation of the WORK.DUMMY data set. Output 8 shows the results obtained using dummy data:
| | STYLE |
| |-----------------------------------|
| | CONDO | SPLIT | RANCH |TWOSTORY|
|---------+--------+--------+--------+--------+
| |N |MEAN |N |MEAN |N |MEAN |N |MEAN |
|---------+--+-----+--+-----+--+-----+--+-----+
|BATHS | | | | | | | | |
|1.0 |0 | 0| 1| 1190| 2| 985| 2| 1140|
|2.0 |1 | 1860| 0| 0| 0| 0| 0| 0|
|3.0 |0 | 0| 1| 1615| 2| 1518| 1| 1810|
|1.5 |1 | 1400| 1| 1305| 0| 0| 0| 0|
|2.5 |2 | 1748| 0| 0| 0| 0 1| 1745|
-----------------------------------------------
Output 8 Result of Non-Sequential Dummy Data Approach
Notice that the desired ordering scheme is obtained without affecting the statistics.
Mapping Approach for Obtaining a /* generate the control data set using */ /* dummy in example 6 */ data control;
set dummy;
fmtname = 'BATHS'; /* INFORMAT FOR BATHS */
start = put(baths,3.1);
fmtname = 'STYLE'; /* FORMAT FOR STYLE */
start = put(_n_,8.);
fmtname = 'BATHS'; /* FORMAT FOR BATHS */
label = put(baths,3.1); /* separate the informat from the format by sorting */ proc sort data=control nodupkey; by fmtname type start; /* generate format and informat using control data set */ proc format cntlin=control; /* map the data values of master data set */ data houses;
set sasuser.houses; /* OUTPUT 8 */ proc tabulate data=houses noseps order=internal;
class style baths;
style*(valid=' '*n*f=2. sqfeet=' '*mean*f=5.)
/ rts=10 misstext='0';
run;
Example 7 Mapping Solution for Non-Sequential Scheme
There are a few key points to note about Example 7. The START variable is character because it is set to the values of STYLE. Since the values of BATHS must also be stored in START, a PUT function is used to convert the numeric values of BATHS to character. This is important because the INPUT function is used to remap the values of STYLE and BATHS. Since the INPUT function requires character values in its first parameter, the same character strings used to generate the format range must all be used during the mapping process for correct matches to be made. This is accomplished by using the same PUT function that created the START variable as the first parameter of the INPUT function. PROC FORMAT requires that there be no repeated ranges in the START variable. Since TWOSTORY appears twice for STYLE in the DUMMY data set, the NODUPKEY option is used with PROC SORT to eliminate duplicate ranges in STYLE. The result is the same as the Output 8 dummy data result. For this example, the dummy data technique requires less coding than for mapping. However, this is not always the case. It pays to think ahead before choosing a particular technique. Ordering by Statistics PROC SQL can generate a report with values ordered by statistics generated from analysis variables. The following step demonstrates PROC SQLÕs ability:
/* OUTPUT 9 */
options label; select
style label='STYLE',
n(style) label='N Obs',
mean(price) as price label='Mean' format=dollar8.
from sasuser.houses
group style
order price desc;
run; STYLE N Obs Mean CONDO 4 $99,313 TWOSTORY 4 $83,825 SPLIT 3 $77,983 RANCH 4 $68,575 Output 9 Result of PROC SQL Ordering by Statistic
If you cannot use the output generated with PROC SQL, the procedure can also
generate a view or table (output data Ordering a classification variable by the value of a statistic, a more traditional method, involves multiple steps. First, run a procedure to summarize the values and calculate the necessary statistic. Next, use PROC SORT to sort the data set by the statistic. At this point you can choose from ORDER=DATA, dummy data, or mapping. Examples of each variation follow. Using ORDER=DATA to Order by a Statistic When there is only one classification variable that needs to be ordered by a statistic, ORDER=DATA provides a simple solution for deriving the desired order. The following example demonstrates: /* OUTPUT 10a */ proc means data=sasuser.houses mean maxdec=0;
title1 "DEFAULT"; /* calculate the mean of price for each value of style */ proc means data=sasuser.houses nway noprint;
class style;
/* sort by descending mean values of price */
proc sort data=houses; /* OUTPUT 10b */ proc means data=houses mean maxdec=0 order=data;
title1 "DESIRED"; Notice that in the initial ordering sequence STYLE is listed in ascending alphabetical order. In the desired ordering sequence, STYLE is ordered by the descending mean values of PRICE. Obviously PROC SQL is the better approach here, because it uses far less code and makes fewer passes through the data. Ordering by a Statistic When Multiple Class Variables Are Used When using multiple class variables, ordering by statistics becomes complicated. Your approach depends on the layout of the report. Dummy data sets and mapping choices are different for list style output than for crosstabulations. For the purpose of this article, a simple example is discussed. More complex ordering schemes are discussed in Part 3 of this series. The following example presents the default ordering sequence: /* OUTPUT 11 */ proc tabulate data=sasuser.houses noseps format=7.;
class style baths; The goal is to order STYLE by the descending values of each mean in the ALL column while maintaining the numeric order for BATHS. The two examples demonstrate using dummy data and then mapping. DEFAULT DESIRED ORDERING SEQUENCE ORDERING SEQUENCE Analysis Variable : PRICE Analysis Variable : PRICE STYLE N Obs Mean STYLE N Obs Mean ----------------------------- ----------------------------- CONDO 4 99313 CONDO 4 99313 RANCH 4 68575 TWOSTORY 4 83825 SPLIT 3 77983 SPLIT 3 77983 TWOSTORY 4 83825 RANCH 4 68575 ----------------------------- ----------------------------- Output 10a Output 10b Output 10a-10b Result of Ordering by Statistic Using ORDER=DATA Dummy Data Solution for Ordering by a Statistic The following example demonstrates using the dummy data method to order by a statistic: /* calculate the mean of price for each value of style */ proc summary data=sasuser.houses nway;
class style; /* sort by descending mean values of price */ proc sort data=temp1; by descending price; /* generate ascending values for bedrooms */ proc freq data=sasuser.houses; tables baths / out=temp2 noprint; /* generate the dummy data set */ data dummy;
merge temp1(in=a) temp2(in=b); /* concatenate dummy with sasuser.houses */ data houses;
set dummy sasuser.houses; /* OUTPUT 13 */ proc tabulate data=houses noseps format=7. order=data;
class style baths; The problem of multiple class variables is addressed by generating two temporary data sets and merging them together. The WORK.DUMMY data set can have no missing values because the observations would be ignored by PROC TABULATE. To prevent this, the IN= variables A and B are used to indicate when a temporary data set has been exhausted of observations and additional values are needed to fill in. Since the fill in value is unimportant, the first observation from the exhausted data set is used. The following is a partial listing of WORK.HOUSES: STYLE BATHS PRICE CONDO 1.0 . TWOSTORY 1.5 . SPLIT 2.0 . RANCH 2.5 . CONDO 3.0 . RANCH 1.0 64000 SPLIT 1.0 65850 CONDO 1.5 80050 TWOSTORY 3.0 107250 RANCH 3.0 86650 Notice that the first five observations have missing values for PRICE. The observations are originally from the WORK.DUMMY data set. There are five observations because BATHS has five unique values. Since there are only four unique values for STYLE, the first value of STYLE is used to fill in for the fifth BATHS observation. Notice that the column for ALL lists the mean values in descending order. The result is obtained because the DUMMY data set has STYLE and BATHS in the desired order to be used with the ORDER=DATA option.
| BATHS | |
|---------------------------------------| |
| 1 | 1.5 | 2 | 2.5 | 3 | ALL |
|-------+-------+-------+-------+-------+-------|
| PRICE | PRICE | PRICE | PRICE | PRICE | PRICE |
|-------+-------+-------+-------+-------+-------|
| MEAN | MEAN | MEAN | MEAN | MEAN | MEAN |
--------------+-------+-------+-------+-------+-------+-------|
STYLE | | | | | | |
CONDO | .| 80050| 110700| 103250| .| 99313|
RANCH | 49275| .| .| .| 87875| 68575|
SPLIT | 65850| 73650| .| .| 94450| 77983|
TWOSTORY | 62550| .| .| 102950| 107250| 83825|
Output 11 Default Ordering Sequence for Statistic Ordering
Mapping Solution for Ordering by a Statistic The mapping method begins with summarizing the values and calculating the statistics just as the dummy data method does. Once the input data set is summarized and the statistics are calculated and stored in an output data set, the output data set can then be transformed into a control data set for mapping. The following example demonstrates: /* calculate the mean statistic for each style */ proc summary data=sasuser.houses nway;
class style; /* sort by descending means of price */ proc sort data=key; by descending price; /* build control data set */ data control;
set key; /*separate the informat from the format by sorting */ proc sort data=control; by type; /* generate format and informat using control data set */ proc format cntlin=control; /* map the data values of master data set */ data houses;
set sasuser.houses; /* OUTPUT 12 */ proc tabulate data=houses noseps format=7.;
class style baths;
| BATHS | |
|---------------------------------------| |
| 1 | 1.5 | 2 | 2.5 | 3 | ALL |
|-------+-------+-------+-------+-------+-------|
| PRICE | PRICE | PRICE | PRICE | PRICE | PRICE |
|-------+-------+-------+-------+-------+-------|
| MEAN | MEAN | MEAN | MEAN | MEAN | MEAN |
--------------+-------+-------+-------+-------+-------+-------|
STYLE | | | | | | |
CONDO | .| 80050| 110700| 103250| .| 99313|
TWOSTORY | 62550| .| .| 102950| 107250| 83825|
SPLIT | 65850| 73650| .| .| 94450| 77983|
RANCH | 49275| .| .| .| 87875| 68575|
Output 12 Result of Ordering by a Statistic
For this example, the mapping solution requires slightly more code than the dummy data approach. However, the mapping solution is less complicated because only one class variable needs to be mapped. Mapping STYLE produces the same result as seen in Output 12. Thus, dummy data and mapping are both suitable methods. Conclusion SAS, SAS/ASSIST, and Observations are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ¨ indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Appendix options nodate nonumber nolabel ls=64; The following is the SASUSER.HOUSES data set, which is used to supply the data for most examples in this article: STYLE SQFEET BEDROOMS BATHS STREET PRICE RANCH 1250 2 1.0 Sheppard Avenue $64,000 SPLIT 1190 1 1.0 Rand Street $65,850 CONDO 1400 2 1.5 Market Street $80,050 TWOSTORY 1810 4 3.0 Garris Street $107,250 RANCH 1500 3 3.0 Kemble Avenue $86,650 SPLIT 1615 4 3.0 West Drive $94,450 SPLIT 1305 3 1.5 Graham Avenue $73,650 CONDO 1390 3 2.5 Hampshire Avenue $79,350 TWOSTORY 1040 2 1.0 Sanders Road $55,850 CONDO 2105 4 2.5 Jeans Avenue $127,150 RANCH 1535 3 3.0 State Highway $89,100 TWOSTORY 1240 2 1.0 Fairbanks Circle $69,250 RANCH 720 1 1.0 Nicholson Drive $34,550 TWOSTORY 1745 4 2.5 Highland Road $102,950 CONDO 1860 2 2.0 Arcata Avenue $110,700> |