TS-DOC: TS-408 - Ordering Data Values: Dummy Data and Mapping Techniques

TS-408

Ordering Data Values: Dummy Data and Mapping Techniques

by Jason R. Sharpe
Jason R. Sharpe is a technical support analyst at SAS Institute. His areas of expertise include base SAS software procedures and the macro facility. Jason has BS degrees in electrical engineering and computer engineering from North Carolina State University.

Abstract
This is the second article in a three-part series about ordering data values. In the previous issue of Observations¨, "Ordering Data Values: Procedure Defaults and Options" discusses the similarities and differences between the ways procedures order data values. Part 2 demonstrates dummy data and mapping techniques for controlling the order of data values when procedure defaults and options are insufficient. The dummy data method involves concatenating a dummy data set in front of an original data set to introduce the desired order that can be obtained with ORDER=DATA. Mapping involves changing the data values to numbers corresponding to the desired order and then using a format to display the original values. The final article in this series, "Double Mapping and Combining Techniques," will discuss double mapping and combining techniques to obtain complex ordering schemes. Double mapping involves creating dummy variables to be used as higher order class variables to control the order of the remaining class variables. Combining techniques involves using multiple techniques simultaneously.

Introduction
The primary focus of this series is controlling the order of data values in the CHART, FREQ, MEANS, REPORT, SUMMARY, and TABULATE procedures. However, information contained in this article can be useful for other procedures as well. This article assumes you have an intermediate knowledge of these procedures, as well as the SORT and FORMAT procedures and the DATA step. You should also be familiar with the information discussed in "Procedure Defaults and Options." If you encounter syntax that is unfamiliar to you, please consult the appropriate documentation before continuing. Most examples use the SASUSER.HOUSES data set that is distributed with SAS/ASSIST¨ software. The SASUSER.HOUSES data set and the system options used in the examples appear in the Appendix.

Ordering Techniques
Two ordering techniques are discussed in the article. The Dummy data technique involves appending dummy data to the main data so the desired order can be obtained using ORDER=DATA. The second technique, mapping, uses formats to display the original values that are replaced with numbers corresponding to the desired order to be obtained with ORDER=INTERNAL.

Dummy Data
The dummy data method involves concatenating a dummy data set in front of the original data set so the desired order can be obtained with ORDER=DATA. It is most useful when the desired order is known before program execution and when there are relatively few data values involved. However, its use can be extended for applications with a large number of data values.

Define the Problem
Consider the following example and output:

/* OUTPUT 1 */
proc tabulate data=sasuser.houses noseps;

class style baths;
var sqfeet;
table baths, style*sqfeet=' '*(n*f=2. mean*f=5.)

/ rts=10 misstext='0';
format baths 3.1;
run;
Example 1 Default Ordering Sequence for Dummy Data Example

|         |               STYLE               |
|         |-----------------------------------|
|         | CONDO  | RANCH  | SPLIT  |TWOSTORY|
|---------+--------+--------+--------+--------+
|         |N |MEAN |N |MEAN |N |MEAN |N |MEAN |
|---------+--+-----+--+-----+--+-----+--+-----+
|BATHS    |  |     |  |     |  |     |  |     |
|1.0      |0 |    0| 2|  985| 1| 1190| 2| 1140|
|1.5      |1 | 1400| 0|    0| 1| 1305| 0|    0|
|2.0      |1 | 1860| 0|    0| 0|    0| 0|    0|
|2.5      |2 | 1748| 0|    0| 0|    0| 1| 1745|
|3.0      |0 |    0| 2| 1518| 1| 1615| 1| 1810|
-----------------------------------------------
Output 1        Default Ordering Sequence for Dummy Data Example

Suppose you want to order the data values of STYLE so they appear in the following order: CONDO, SPLIT, RANCH, and TWOSTORY, while maintaining the numeric order for BATHS. The desired order for STYLE is neither ascending nor descending alphabetically, and the values in the input data set are not in the desired order. Thus, neither procedure defaults nor options can produce the desired result. To solve the problem, append a dummy data set before the original data set to obtain the order with ORDER=DATA.

Create the Dummy Data Set
First, create a dummy data set. The dummy data set should contain all classification variables, but not BY variables. In Example 1, the classification variables are STYLE and BATHS. The dummy data set must contain only these variables, with their data values listed in the desired order. Since the order for each variable is determined independently, the values are completely unrelated to one another. It is important, however, that the dummy data set contain no undesired missing values so that you avoid unwanted missing levels in your output. If one variable has fewer unique values than another, any previous value for that variable can be used to fill in. The following example creates the dummy data set needed to control the order of STYLE and BATHS:

data dummy;

length style $ 8;
input style baths;
cards;

CONDO     1
SPLIT     1.5
RANCH     2

TWOSTORY 2.5
TWOSTORY 3
;

Notice that the value of TWOSTORY is repeated at the end of the dummy data set. The value is repeated to fill in the value of STYLE for the extra value of BATHS. This is important because, by default, procedures ignore observations in which a class variable is missing. Thus, frequency counts, sums, and other statistics could have incorrect results. Alternatively, if the MISSING option is used to keep observations with missing class levels, then missing levels appear in the output. Concatenating without BY Groups
If you do not plan to use a BY statement in your procedure, you can use the following DATA step to concatenate the WORK.DUMMY and SASUSER.HOUSES data sets:

data houses;

set dummy sasuser.houses(in=is_valid); if is_valid then valid=1;
run;

Example 2 Concatenating the Dummy Data Set without BY Groups

A partial listing of observations from the WORK.HOUSES data set follows. The first five observations are from the WORK.DUMMY data set.

STYLE       BATHS    SQFEET    VALID
CONDO        1.0         .       .
SPLIT        1.5         .       .
RANCH        2.0         .       .
TWOSTORY     2.5         .       .
TWOSTORY     3.0         .       .
RANCH        1.0      1250       1
SPLIT        1.0      1190       1
CONDO        1.5      1400       1

Notice that this DATA step creates a new variable named VALID that distinguishes valid observations in the SASUSER.HOUSES data set from dummy observations in the WORK.DUMMY data set. This is necessary if you plan to base any calculations on frequency counts and need to prevent the dummy observations from being counted.

Concatenating with BY Groups
If you use BY groups, you need a slightly different method for concatenation to ensure the dummy data set precedes each BY group, because the procedure re-establishes the order each time the BY group changes. The following example concatenates the WORK.DUMMY data set to SASUSER.HOUSES using BEDROOMS as a BY variable:

/* create work.houses by sorting sasuser.houses */ proc sort data=sasuser.houses out=houses;

by bedrooms;

/* append dummy in front of each by group */ data houses;

/* loop once for each observation in work.houses */ do i1=1 to nobs1;

set houses point=i1 nobs=nobs1;

         /* check for the beginning of a new by group   */
      if bedrooms ne last_bed then do;

            /* loop once for each observation in dummy    */
         do i2=1 to nobs2;
            set dummy point=i2 nobs=nobs2;

               /* set valid=. to indicate a dummy ob      */
            valid=.;

               /* set all analysis variables to missing   */
            sqfeet=.;

               /* output dummy observation to work.houses */
            output;
         end;

            /* output valid observation to work.houses    */
         set houses point=i1;
         valid=1;
         output;
      end;

         /* output valid observation to work.houses       */
      else output;

         /* prepare last_bed for next by group check      */
      last_bed = bedrooms;

end;

/* force data step to stop because point= is used */ stop;

Notice that SQFEET is set to missing prior to the SET statement for the WORK.DUMMY data set. For correct results, analysis variables must be set to missing to prevent their values from being retained in the observations from the WORK.DUMMY data set. If there are many analysis variables, this can be cumbersome to code. Arrays, the macro facility, or the SQL procedure can reduce the code needed to set analysis variables to missing, but a discussion of these tools is beyond the scope of this article.

The desired values for the analysis variables are retrieved from the SET statement for the WORK.HOUSES data set. Because a BY statement cannot be used with the SET statement option POINT=, a retained variable, LAST_BED, is used to determine when a BY group has changed. The following is a partial listing of the resulting data set:

BEDROOMS    STYLE       BATHS    SQFEET    VALID
    1       CONDO        1.0         .       .
    1       SPLIT        1.5         .       .
    1       RANCH        2.0         .       .
    1       TWOSTORY     2.5         .       .
    1       TWOSTORY     3.0         .       .
    1       SPLIT        1.0      1190       1
    1       RANCH        1.0       720       1
    2       CONDO        1.0         .       .
    2       SPLIT        1.5         .       .
    2       RANCH        2.0         .       .
    2       TWOSTORY     2.5         .       .
    2       TWOSTORY     3.0         .       .
    2       RANCH        1.0      1250       1
    2       CONDO        1.5      1400       1
    2       TWOSTORY     1.0      1040       1
    2       TWOSTORY     1.0      1240       1
    2       CONDO        2.0      1860       1

Notice that the observations from the DUMMY data set appear at the beginning of each BY group and that for these observations the value for VALID is missing. The observations from the SASUSER.HOUSES data set are distinguished by VALID=1.

Generate the Report
The final step in the dummy data approach is to use the concatenated data set in a procedure with the ORDER=DATA option. The following example is a modification of Example 1 that uses the WORK.HOUSES data set created in Example 2:

/* OUTPUT 4 */ proc tabulate data=houses noseps
order=data;

class style baths;
var sqfeet valid;
table baths, style*(valid=' '*n*f=2. sqfeet=''*mean*f=5.)

/ rts=10 misstext='0';
format baths 3.1;
run;

|         |               STYLE               |
|         |-----------------------------------|
|         | CONDO  | SPLIT  | RANCH  |TWOSTORY|
|---------+--------+--------+--------+--------+
|         |N |MEAN |N |MEAN |N |MEAN |N |MEAN |
|---------+--+-----+--+-----+--+-----+--+-----+
|BATHS    |  |     |  |     |  |     |  |     |
|1.0      |0 |    0| 1| 1190| 2|  985| 2| 1140|
|1.5      |1 | 1400| 1| 1305| 0|    0| 0|    0|
|2.0      |1 | 1860| 0|    0| 0|    0| 0|    0|
|2.5      |2 | 1748| 0|    0| 0|    0| 1| 1745|
|3.0      |0 |    0| 1| 1615| 2| 1518| 1| 1810|

Output 2 Desired Ordering Sequence Obtained with the Dummy Data Method

The final ordering sequence in Output 2 shows the values of STYLE in the desired order of CONDO, SPLIT, RANCH, and TWOSTORY, while maintaining an ascending numeric sequence for BATHS. Notice that even though dummy observations were added to the SASUSER.HOUSES data set, the frequency counts are not affected because these frequency counts are based on the non-missing values of VALID.

In addition to controlling an ordering scheme, dummy data can also be used to guarantee that all possible values for a class variable will appear in the report regardless of whether they appear in the original data. A detailed discussion of this can found in Observations¨, Volume 2, Number 1, Fourth Quarter 1992, pages 64-69.

Mapping Techniques
Mapping is a technique through which the internal data values are reassigned to values that, when sorted, result in the desired order. Mapping is especially useful for applications with many data values to control. This article demonstrates two different approaches to the mapping method. These methods can be extended for use in larger, more complicated applications.

Define the Problem
Consider the following example:

/* OUTPUT 3 */ proc tabulate data=sasuser.houses noseps order=data;

class style baths;
var sqfeet;
table baths, style*sqfeet=' '*(n*f=2. mean*f=5.)

/ rts=10 misstext='0';
format baths 3.1;
run;
Example 3 Default Ordering Sequence for Mapping Example

|         |               STYLE               |
|         |-----------------------------------|
|         | RANCH  | SPLIT  | CONDO  |TWOSTORY|
|---------+--------+--------+--------+--------+
|         |N |MEAN |N |MEAN |N |MEAN |N |MEAN |
|---------+--+-----+--+-----+--+-----+--+-----+
|BATHS    |  |     |  |     |  |     |  |     |
|1.0      |2 |  985| 1| 1190| 0|    0| 2| 1140|
|1.5      |0 |    0| 1| 1305| 1| 1400| 0|    0|
|3.0      |2 | 1518| 1| 1615| 0|    0| 1| 1810|
|2.5      |0 |    O| 0|    0| 2| 1748| 1| 1745|
|2.0      |0 |    0| 0|    0| 1| 1860| 0|    0|

Output 3 Default Ordering Sequence for Mapping Example Notice that BATHS is not in numeric order. Instead, it reflects the sequence of the values in the input data set. The goal is to maintain the order for STYLE achieved by using ORDER=DATA and to order BATHS in ascending numeric order. The easiest solution is to use PROC REPORT rather than PROC TABULATE, because PROC REPORT lets you assign a different ORDER= setting to each classification variable. In this situation, BATHS is defined with the ORDER=INTERNAL option, and STYLE is defined with the ORDER=DATA option. The following example demonstrates the PROC REPORT alternative:

/* OUTPUT 4 */ proc report data=sasuser.houses headline spacing=1 nowd;

column baths style,('--' sqfeet sqfeet=avgsqft)

('ALL/--' sqfeet=all_n sqfeet=all_mean); define baths / group format=5.1 order=internal left; define style / across order=data "STYLE/--"; define sqfeet / n format=2.0 "N"; define avgsqft / mean format=5.0 "MEAN"; define all_n / n format=2.0 "N"; define all_mean / mean format=5.0 "MEAN"; run;

                              STYLE
              ------------------------------------
               RANCH   SPLIT    CONDO   TWOSTORY   ALL
              ------- -------- -------- -------- --------
    BATHS     N  MEAN  N  MEAN  N  MEAN  N  MEAN  N  MEAN
    -----------------------------------------------------
    1.0       2   985  1  1190  0     0  2  1140  5  1088
    1.5       0     0  1  1305  1  1400  0     0  2  1353
    2.0       0     0  0     0  1  1860  0     0  1  1860
    2.5       0     0  0     0  2  1748  1  1745  3  1747
    3.0       2  1518  1  1615  0     0  1  1810  4  1615

Output 4        Desired Ordering Scheme Generated by PROC REPORT

This example solves the problem by simply choosing a different procedure. To demonstrate mapping techniques, however, letÕs assume that PROC TABULATE must be used for a required report style. Since there are no options with PROC TABULATE that produce the desired ordering scheme, mapping techniques must be used. Non- Dynamic Mapping Techniques
To achieve the desired order, the internal values of BATHS and STYLE need to be values that naturally sort into the desired order. Since BATHS already has values that sort in ascending order, only the values of STYLE need to be mapped. The following table shows the mapping needed:

Original             New             Format       Format
Internal Value    Internal Value     Range        Label
RANCH                  1               1          RANCH
SPLIT                  2               2          SPLIT
CONDO                  3               3          CONDO
TWOSTORY               4               4          TWOSTORY

Table 1 Required Mapping of STYLE

The next example demonstrates how to explicitly recode the values of STYLE. This example emphasizes the actual mapping of the values. More dynamic techniques are generally preferred when many data values are involved. A method for dynamic processing is illustrated in the section, "Mapping with Dynamic Techniques."

/* create an informat and format to map the values */ proc format;

invalue style 'RANCH'=1

                 'SPLIT'=2
                 'CONDO'=3
                 'TWOSTORY'=4;

value style 1='RANCH'

                 2='SPLIT'
                 3='CONDO'
                 4='TWOSTORY';

/* change old internal values to new sequential values */ data houses;

set sasuser.houses;
style2=input(style,style.);
drop style;
rename style2=style;
format style2 style.;
run;

Example 4 Mapping without Dynamic Techniques

In this example, numeric informats and formats are used to establish a numeric order when sorted. Although character informats and formats would require less space to store, they will not work because numbers stored in character variables do not sort numerically. For example, a character string Ô11Õ sorts before Ô2Õ, whereas a numeric 2 sorts before a numeric 11.

The INPUT function matches the value of STYLE to a specific range in the STYLE. informat and assigns the corresponding informat label to the variable STYLE2. STYLE is then dropped, and STYLE2 is renamed to STYLE. To complete the mapping, the STYLE. format is permanently associated with STYLE using a FORMAT statement.

Notice that the same name is used for both the informat and format. This is possible because the SAS System determines whether to interpret the name as an informat or as a format based on how it is used in the syntax. For example, if STYLE. is used in an INPUT function, the SAS System looks for an informat. Likewise, if STYLE. is used on a PUT statement, then it is interpreted as a reference to a format.

Original Values of      WORK.HOUSES     WORK.HOUSES
STYLE in                after           with
SASUSER.HOUSES          Remapping Style Format Applied
RANCH                   1               RANCH
SPLIT                   2               SPLIT
CONDO                   3               CONDO
TWOSTORY                4              TWOSTORY

Table 2 SASUSER.HOUSES before and after Mapping STYLE

Notice that columns 1 and 3 are identical. The actual internal values are shown in column 2. The order is based on these internal values. Thus, when the format is applied, the desired mapping has been achieved.

Example 4 is not considered dynamic because it requires that you know values for STYLE and that you specifically code each possible value in the PROC FORMAT step. If there are hundreds of values to code, this task would be very tedious. Fortunately, there are dynamic methods that can produce the same result.

Mapping with Dynamic Techniques
Several tools are available to generate the code and the formats needed to remap data. One especially useful tool is a PROC FORMAT control data set.

PROC FORMAT control data sets provide the ability to create a format using a SAS data set as input. This allows you to map data values without knowing what the data values are since you just want to display the values as they occur in the input data set. Since PROC TABULATE orders class variable values in ascending order by default, only STYLE needs to
be mapped.

PROC FREQ with the ORDER=DATA option creates an output data set with STYLE arranged in the desired order. The output data set is then modified to conform to the structure of a control data set. This example achieves the desired mapping without specifying variable values in the code:

/* build the key data set using order=data */ proc freq data=sasuser.houses order=data;

tables style / out=key(keep=style) noprint;

/* build the control data set */ data control;

set key;
fmtname='STYLE';

/* Output A numeric informat observation */ start =style;
label =put(_n_,8.);
type='I';
output;

/* Output A numeric format observation */ start =put(_n_,8.);
label =style;
type='N';
output;

/* separate the informat from the format by sorting */ /* because the informat and format must be contiguous */ /* for proc format to correctly use them */ proc sort data=control;
by type;

   /* generate format and informat using the cntlin=      */
   /* option with proc format                             */

proc format cntlin=control;

/* remap the style values in master data set */ data houses;
set sasuser.houses;
style2=input(style,style.);
drop style;
rename style2=style;
format style2 style.;
run;

Example 5 Mapping with Dynamic Techniques

This process creates a data set with a single observation for each data value to be mapped. These values must exist in the data set in the order desired for the report. This is easily achieved by creating an output data set with PROC FREQ. A control data set with two observations for each unique data value is then created. The first observation is for a numeric informat, the second for a numeric format. The informat observations are separated from the format observations by sorting the control data set by TYPE. This is necessary because PROC FORMAT requires that all informat and format observations be contiguous in the control data set. The informat and format are then created by running PROC FORMAT with the CNTLIN= option. Finally, the mapping is performed by using the INPUT function with the new informat. The original values are restored by the FORMAT statement.

The same result is achieved with both Example 5 and Example 4. The hard- coded approach in Example 4 requires less code for this example, but it does require that you know what the data values are. If there are many data values to map, use the dynamic approach in Example 5 which requires far less code and no program maintenance. Using PROC FORMAT control data sets is an essential tool for many ordering problems. For more information on using PROC FORMAT control data sets, refer to SAS Procedures Guide, Version 6, Third Edition and to "Creating Formats with Control Data Sets Using the FORMAT Procedure," Observations, First Quarter 1992. Generate the Report

Finally, use the mapped data set in a procedure. The following example is a modification of Example 3, using the new internal values and the corresponding format in conjunction with the ORDER=INTERNAL option:

/* OUTPUT 5 */ proc tabulate data=houses noseps;

class style baths;
var sqfeet;
table baths, style*sqfeet=' '*(n*f=2. mean*f=5.)

/ rts=10 misstext='0';
format baths 3.1;
run;

       |         |               STYLE               |
       |         |-----------------------------------|
       |         | RANCH  | SPLIT  | CONDO  |TWOSTORY|
       |---------+--------+--------+--------+--------+
       |         |N |MEAN |N |MEAN |N |MEAN |N |MEAN |
       |---------+--+-----+--+-----+--+-----+--+-----+
       |BATHS    |  |     |  |     |  |     |  |     |
       |1.0      |2 |  985| 1| 1190| 0|    0| 2| 1140|
       |1.5      |0 |    0| 1| 1305| 1| 1400| 0|    0|
       |2.0      |0 |    0| 0|    0| 1| 1860| 0|    0|
       |2.5      |0 |    0| 0|    0| 2| 1748| 1| 1745|
       |3.0      |2 | 1518| 1| 1615| 0|    0| 1| 1810
       |ALL      |4 | 1251| 3| 1370| 4| 1689| 4| 1459|
       -----------------------------------------------

Output 5        Desired Ordering Sequence for Mapping Problem

Notice that the desired ordering scheme of BATHS in numeric order and STYLE in an ORDER=DATA sequence is achieved without any effect on the statistics. Note that using ORDER=INTERNAL in the PROC TABULATE statement is not necessary because it is the default for this procedure.

Common Ordering Problems
The remainder of this article concentrates on illustrating common ordering problems, comparing and contrasting dummy data and mapping methods. The following common ordering problems are solved:

ordering formatted SAS date values
obtaining a non-sequential ordering scheme
ordering by the result of a statistic.

Ordering Formatted SAS Date Values
SAS date values are numeric values corresponding to the number of days from January 1, 1960. Since the values are numeric, a SAS date variable can be ordered by its internal values to appear in a correct calendar sequence. Usually this is done by using the ORDER=INTERNAL option and applying a format to the SAS date variable. Complications can arise if you need to group the SAS date values by day of the week, month, or year. Problems occur when the order of the lowest internal values for the format groups does not correspond to the expected sequence. Consider the following example:

/* create sample data set */ data sample;
input date date7.;
cards;

06JAN93   /* Wednesday - SAS date value: 12059            */
08JAN93   /* Friday    - SAS date value: 12061            */
09JAN93   /* Saturday  - SAS date value: 12062            */
11JAN93   /* Monday    - SAS date value: 12064            */
14JAN93   /* Thursday  - SAS date value: 12067            */
17JAN93   /* Sunday    - SAS date value: 12070            */
19JAN93   /* Tuesday   - SAS date value: 12072            */
21JAN93   /* Thursday  - SAS date value: 12074            */
22JAN93   /* Friday    - SAS date value: 12075            */

;

/* OUTPUT 6 */ proc chart data=sample;

hbar date / discrete space=0 axis=0 1 2 nostat; format date weekdate3.;
run;

Here, the day of the week and SAS date value for each date are provided for clarity. By default, PROC CHART orders the data values by their internal values.

DATE
Wed |********** Fri |******************** Sat |********** Mon |********** Thu |******************** Sun |********** Tue |**********

      |
      ----------+---------+
                1         2
      Frequency

Output 6        Complications Due to Lowest Internal Values

Notice that the days of the week do not appear in a standard calendar sequence or in alphabetical order. These values are ordered by their internal, numeric SAS date values. When a single format label applies to more than one internal value, only the lowest internal value is considered when ordering the group. The value of Wed is first in the sequence because the lowest internal SAS date value falls on a Wednesday (06JAN93).

Solutions to the problem are attempted with dummy data and then with mapping. In this situation, mapping is the technique of choice because PROC CHART ignores the dummy observations if the VALID variable is used to maintain correct frequency counts. The following two examples show the results of attempting to correct the problem with dummy data and then with mapping.

Dummy Data Solution for Formatted
SAS Date Values
Consider the following example:

/* generate the dummy data set */ data dummy;

do date='03JAN60'D to '09JAN60'D;

output;
end;
run;

/* concatenate WORK.DUMMY with WORK.SAMPLE */ data sample2;

set dummy sample(in=is_valid);
if is_valid then valid=1;
run;

/* OUTPUT 6 */ proc chart data=sample2;

hbar date / discrete space=0 axis=0 1 2 nostat missing

freq=valid format date weekdate3.; run;

The required dummy SAS date values are generated from the dates 03JAN60 through 09JAN60, corresponding to Sun through Sat respectively. However, dummy data do not work for PROC CHART. Normally, the VALID variable is necessary to maintain the original frequency counts. PROC CHART does not use an observation with a missing value for a FREQ= or SUMVAR= variable. Since VALID is missing on every observation in the DUMMY data set and the FREQ=VALID option is used, all the observations in the DUMMY data set are ignored. The option MISSING is not helpful, since MISSING applies only to missing values in the chart variable, not the FREQ= or SUMVAR= variables. Thus, using dummy data results in the same ordering as in Output 6 and is, therefore, insufficient for solving this sort of ordering problem for PROC CHART. For more information on how PROC CHART handles missing values, refer to page 169 of SAS Technical Report P-222: Changes and Enhancements to Base SAS Software,
Release 6.07.

Mapping Alternative for Formatted SAS Date Values Fortunately, the mapping technique does eliminate the ordering problem for PROC CHART. The following example illustrates:

/* map the sas date values to numeric weekdays using */ /* the weekday function */ data sample2;

set sample;
weekday=weekday(date);
drop date;
rename weekday=date;

/* create a day of the week format */ proc format; value dayfmt 1='Sun' 2='Mon' 3='Tue' 4='Wed' 5='Thu' 6='Fri' 7='Sat';

/* OUTPUT 7 */ proc chart data=sample2;

hbar date / discrete space=0 axis=0 1 2 nostat; format date dayfmt.;
run;

In this example, the original date values are mapped to numeric values 1 through 7 as returned by the WEEKDAY function. The user-defined format then displays the day the number corresponds to. Output 7 shows the results of the mapping method.

DATE

Sun |********** Mon |********** Tue |********** Wed |********** Thu |******************** Fri |******************** Sat |**********

      |
      ----------+---------+
                1         2
      Frequency
Output 7        Results of Mapping Alternative

Sometimes the mapping method is insufficient for solving ordering problems. An example would be forcing all possible values of a class variable to appear in a certain order when all the values may or may not exist in the original data set. A specific discussion of this situation can be found in Observations, Fourth Quarter 1992, pages 64-69.

Obtaining a Non-Sequential Ordering Scheme In situations where the desired order for the classification variable is not sequential, you can use dummy data or mapping. Your choice depends on a number of factors, such as which procedure will be used or the number of data set variables and data values to control. Generally, dummy data is the method of choice when there are few class variables and data values to order. The mapping approach proves more useful in cases with many class variables, but when only a few need non-sequential ordering schemes. The following two examples demonstrate the solution using dummy data and then mapping. The goal is to have BATHS ordered as 1.0, 2.0, 3.0, 1.5, 2.5 and STYLE ordered as CONDO, SPLIT, RANCH, and TWOSTORY. Dummy Data Solution for a Non-Sequential Order

The following example demonstrates solving the problem using the dummy data method:

/* create the dummy data set */ data dummy;

length style $ 8;
input style baths;
cards;

CONDO     1
SPLIT     2
RANCH     3

TWOSTORY 1.5
TWOSTORY 2.5
;

/* concatenate dummy with sasuser.houses */ data houses;

set dummy sasuser.houses(in=is_valid); if is_valid then valid=1;
run;

/* OUTPUT 8 */ proc tabulate data=houses noseps order=data;

class style baths;
var sqfeet valid;
table baths, style*(valid=' '*n*f=2. sqfeet=' '*

         mean*f=5.)
      / rts=10 misstext='0';

format baths 3.1;
run;

Example 6 Dummy Data for Non-Sequential Scheme

In this case, the dummy data approach proves to be simple. The only change from the solution for Example 1 is the creation of the WORK.DUMMY data set. Output 8 shows the results obtained using dummy data:

       |         |               STYLE               |
       |         |-----------------------------------|
       |         | CONDO  | SPLIT  | RANCH  |TWOSTORY|
       |---------+--------+--------+--------+--------+
       |         |N |MEAN |N |MEAN |N |MEAN |N |MEAN |
       |---------+--+-----+--+-----+--+-----+--+-----+
       |BATHS    |  |     |  |     |  |     |  |     |
       |1.0      |0 |    0| 1| 1190| 2|  985| 2| 1140|
       |2.0      |1 | 1860| 0|    0| 0|    0| 0|    0|
       |3.0      |0 |    0| 1| 1615| 2| 1518| 1| 1810|
       |1.5      |1 | 1400| 1| 1305| 0|    0| 0|    0|
       |2.5      |2 | 1748| 0|    0| 0|    0  1| 1745|
       -----------------------------------------------
Output 8        Result of Non-Sequential Dummy Data Approach

Notice that the desired ordering scheme is obtained without affecting the statistics.

Mapping Approach for Obtaining a
Non-Sequential Scheme
The mapping technique requires that data values be altered to numbers corresponding to the desired ordering sequence and that formats be created to display the original values. Some of your alternatives to accomplish this are INVALUE and VALUE statements, DATA step IF/THEN or SELECT statements, or control data sets. For this situation, the method requiring the least syntax is the INVALUE and VALUE statements in PROC FORMAT. However, since this has already been demonstrated, control data sets will be used instead to illustrate how to generate multiple informats and formats simultaneously. For this example, the DUMMY data set created in Example 6 is used as input:

   /* generate the control data set using       */
   /* dummy in example 6        */

data control;

set dummy;
fmtname = 'STYLE'; /* INFORMAT FOR STYLE */ start = style;
label = put(_n_,8.);
type = 'I';
output;

fmtname = 'BATHS'; /* INFORMAT FOR BATHS */ start = put(baths,3.1);
output;

fmtname = 'STYLE'; /* FORMAT FOR STYLE */ start = put(_n_,8.);
label = style;
type = 'N';
output;

fmtname = 'BATHS'; /* FORMAT FOR BATHS */ label = put(baths,3.1);
output;

/* separate the informat from the format by sorting */ proc sort data=control nodupkey;

by fmtname type start;

/* generate format and informat using control data set */ proc format cntlin=control;

/* map the data values of master data set */ data houses;

set sasuser.houses;
style2=input(style,style.);
baths2=input(put(baths,3.1),baths.); drop style baths;
rename style2=style baths2=baths;
format style2 style. baths2 baths.;
run;

/* OUTPUT 8 */ proc tabulate data=houses noseps order=internal;

class style baths;
var sqfeet;
table baths,

      style*(valid=' '*n*f=2. sqfeet=' '*mean*f=5.)
      / rts=10 misstext='0';
run;
Example 7       Mapping Solution for Non-Sequential Scheme

There are a few key points to note about Example 7. The START variable is character because it is set to the values of STYLE. Since the values of BATHS must also be stored in START, a PUT function is used to convert the numeric values of BATHS to character. This is important because the INPUT function is used to remap the values of STYLE and BATHS. Since the INPUT function requires character values in its first parameter, the same character strings used to generate the format range must all be used during the mapping process for correct matches to be made. This is accomplished by using the same PUT function that created the START variable as the first parameter of the INPUT function.

PROC FORMAT requires that there be no repeated ranges in the START variable. Since TWOSTORY appears twice for STYLE in the DUMMY data set, the NODUPKEY option is used with PROC SORT to eliminate duplicate ranges in STYLE. The result is the same as the Output 8 dummy data result.

For this example, the dummy data technique requires less coding than for mapping. However, this is not always the case. It pays to think ahead before choosing a particular technique.

Ordering by Statistics
Use the ORDER=FREQ option if you want the values of your classification variable in descending frequency of occurrence. However, if you need to order the classification variable by the value of another statistic, then you need to use alternative ordering techniques. There are two approaches worth considering. The first uses PROC SQL, and the second uses a more traditional SAS programming approach.

PROC SQL can generate a report with values ordered by statistics generated from analysis variables. The following step demonstrates PROC SQLÕs ability:

/* OUTPUT 9 */ options label;
proc sql;

select

      style                label='STYLE',
      n(style)             label='N Obs',
      mean(price) as price label='Mean' format=dollar8.
      from sasuser.houses
      group style
      order price desc;

run;

STYLE N Obs Mean

CONDO            4   $99,313
TWOSTORY         4   $83,825
SPLIT            3   $77,983
RANCH            4   $68,575

Output 9        Result of PROC SQL Ordering by Statistic

If you cannot use the output generated with PROC SQL, the procedure can also generate a view or table (output data
set) that can be used as input to other procedures. An explanation of using PROC SQL is beyond the scope of this article. For more information on PROC SQL, see the SAS Guide to the SQL Procedures, Usage and Reference, Version 6, First Edition.

Ordering a classification variable by the value of a statistic, a more traditional method, involves multiple steps. First, run a procedure to summarize the values and calculate the necessary statistic. Next, use PROC SORT to sort the data set by the statistic. At this point you can choose from ORDER=DATA, dummy data, or mapping. Examples of each variation follow.

Using ORDER=DATA to Order by a Statistic When there is only one classification variable that needs to be ordered by a statistic, ORDER=DATA provides a simple solution for deriving the desired order. The following example demonstrates:

/* OUTPUT 10a */ proc means data=sasuser.houses mean maxdec=0;

title1 "DEFAULT";
title2 "ORDERING SEQUENCE";
class style;
var price;

/* calculate the mean of price for each value of style */ proc means data=sasuser.houses nway noprint;

class style;
var price;
output out=houses mean= n=n;

/* sort by descending mean values of price */ proc sort data=houses;
by descending price;

/* OUTPUT 10b */ proc means data=houses mean maxdec=0 order=data;

title1 "DESIRED";
title2 "ORDERING SEQUENCE";
class style;
var price;
freq n;
run;

Notice that in the initial ordering sequence STYLE is listed in ascending alphabetical order. In the desired ordering sequence, STYLE is ordered by the descending mean values of PRICE. Obviously PROC SQL is the better approach here, because it uses far less code and makes fewer passes through the data.

Ordering by a Statistic When Multiple Class Variables Are Used When using multiple class variables, ordering by statistics becomes complicated. Your approach depends on the layout of the report. Dummy data sets and mapping choices are different for list style output than for crosstabulations. For the purpose of this article, a simple example is discussed. More complex ordering schemes are discussed in Part 3 of this series. The following example presents the default ordering sequence:

/* OUTPUT 11 */ proc tabulate data=sasuser.houses noseps format=7.;

class style baths;
var price;
table style, (baths all)*price*mean; run;

The goal is to order STYLE by the descending values of each mean in the ALL column while maintaining the numeric order for BATHS. The two examples demonstrate using dummy data and then mapping.

DEFAULT                             DESIRED
ORDERING SEQUENCE                   ORDERING SEQUENCE
Analysis Variable : PRICE           Analysis Variable : PRICE
STYLE     N Obs          Mean       STYLE     N Obs          Mean
-----------------------------       -----------------------------
CONDO         4         99313       CONDO         4         99313
RANCH         4         68575       TWOSTORY      4         83825
SPLIT         3         77983       SPLIT         3         77983
TWOSTORY      4         83825       RANCH         4         68575
-----------------------------       -----------------------------
Output 10a                           Output 10b         Output 10a-10b  Result of Ordering by

Statistic Using ORDER=DATA

Dummy Data Solution for Ordering by a Statistic The following example demonstrates using the dummy data method to order by a statistic:

/* calculate the mean of price for each value of style */ proc summary data=sasuser.houses nway;

class style;
var price;
output out=temp1 mean=;

/* sort by descending mean values of price */ proc sort data=temp1;

by descending price;

/* generate ascending values for bedrooms */ proc freq data=sasuser.houses;

tables baths / out=temp2 noprint;

/* generate the dummy data set */ data dummy;

merge temp1(in=a) temp2(in=b);
one=1;
if not a then set temp1 point=one;
if not b then set temp2 point=one;
drop price;

/* concatenate dummy with sasuser.houses */ data houses;

set dummy sasuser.houses;
keep style baths price;

/* OUTPUT 13 */ proc tabulate data=houses noseps format=7. order=data;

class style baths;
var price;
table style, (baths all)*price*mean; run;

The problem of multiple class variables is addressed by generating two temporary data sets and merging them together. The WORK.DUMMY data set can have no missing values because the observations would be ignored by PROC TABULATE. To prevent this, the IN= variables A and B are used to indicate when a temporary data set has been exhausted of observations and additional values are needed to fill in. Since the fill in value is unimportant, the first observation from the exhausted data set is used. The following is a partial listing of WORK.HOUSES:

STYLE       BATHS     PRICE
CONDO        1.0          .
TWOSTORY     1.5          .
SPLIT        2.0          .
RANCH        2.5          .
CONDO        3.0          .
RANCH        1.0      64000
SPLIT        1.0      65850
CONDO        1.5      80050
TWOSTORY     3.0     107250
RANCH        3.0      86650

Notice that the first five observations have missing values for PRICE. The observations are originally from the WORK.DUMMY data set. There are five observations because BATHS has five unique values. Since there are only four unique values for STYLE, the first value of STYLE is used to fill in for the fifth BATHS observation. Notice that the column for ALL lists the mean values in descending order. The result is obtained because the DUMMY data set has STYLE and BATHS in the desired order to be used with the ORDER=DATA option.

              |                 BATHS                 |       |
              |---------------------------------------|       |
              |   1   |  1.5  |   2   |  2.5  |   3   |  ALL  |
              |-------+-------+-------+-------+-------+-------|
              | PRICE | PRICE | PRICE | PRICE | PRICE | PRICE |
              |-------+-------+-------+-------+-------+-------|
              | MEAN  | MEAN  | MEAN  | MEAN  | MEAN  | MEAN  |
--------------+-------+-------+-------+-------+-------+-------|
STYLE         |       |       |       |       |       |       |
CONDO         |      .|  80050| 110700| 103250|      .|  99313|
RANCH         |  49275|      .|      .|      .|  87875|  68575|
SPLIT         |  65850|  73650|      .|      .|  94450|  77983|
TWOSTORY      |  62550|      .|      .| 102950| 107250|  83825|

Output 11       Default Ordering Sequence for Statistic Ordering

Mapping Solution for Ordering by a Statistic The mapping method begins with summarizing the values and calculating the statistics just as the dummy data method does. Once the input data set is summarized and the statistics are calculated and stored in an output data set, the output data set can then be transformed into a control data set for mapping. The following example demonstrates:

/* calculate the mean statistic for each style */ proc summary data=sasuser.houses nway;

class style;
var price;
output out=key mean=;

/* sort by descending means of price */ proc sort data=key;

by descending price;

/* build control data set */ data control;

set key;
fmtname='STYLEF';
start =style;
label =put(_n_,8.);
type='I';
output;
start =put(_n_,8.);
label =style;
type='N';
output;

/*separate the informat from the format by sorting */ proc sort data=control;

by type;

/* generate format and informat using control data set */ proc format cntlin=control;

/* map the data values of master data set */ data houses;

set sasuser.houses;
stylenum=input(style,stylef.);
drop style;
rename stylenum=style;

/* OUTPUT 12 */ proc tabulate data=houses noseps format=7.;

class style baths;
var price;
table style, (baths all)*price*mean; format style stylef.;
run;

              |                 BATHS                 |       |
              |---------------------------------------|       |
              |   1   |  1.5  |   2   |  2.5  |   3   |  ALL  |
              |-------+-------+-------+-------+-------+-------|
              | PRICE | PRICE | PRICE | PRICE | PRICE | PRICE |
              |-------+-------+-------+-------+-------+-------|
              | MEAN  | MEAN  | MEAN  | MEAN  | MEAN  | MEAN  |
--------------+-------+-------+-------+-------+-------+-------|
STYLE         |       |       |       |       |       |       |
CONDO         |      .|  80050| 110700| 103250|      .|  99313|
TWOSTORY      |  62550|      .|      .| 102950| 107250|  83825|
SPLIT         |  65850|  73650|      .|      .|  94450|  77983|
RANCH         |  49275|      .|      .|      .|  87875|  68575|

Output 12       Result of Ordering by a Statistic

For this example, the mapping solution requires slightly more code than the dummy data approach. However, the mapping solution is less complicated because only one class variable needs to be mapped. Mapping STYLE produces the same result as seen in Output 12. Thus, dummy data and mapping are both suitable methods.

Conclusion
Controlling the order of data values can bring additional meaning to your reports. Understanding the default ordering schemes and available options can help you to control the ordering. However, there will be situations where the procedure does not provide an ordering option that meets your needs. In these cases, you can use the ordering techniques discussed here to manipulate the data or create formats to achieve the order
that you want. Dummy data and mapping techniques were introduced in this article to provide solutions to many of these problems. However, there are complex situations that dummy data
or mapping alone can not handle. In the next issue of Observations, the final part of this series will introduce the double mapping method and combining techniques to handle these more complicated situations.

SAS, SAS/ASSIST, and Observations are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ¨ indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.

Appendix
The following OPTIONS and TITLE statements are used with all examples in this article that do not explicitly contain other settings:

options nodate nonumber nolabel ls=64;
title;

The following is the SASUSER.HOUSES data set, which is used to supply the data for most examples in this article:

STYLE    SQFEET BEDROOMS BATHS STREET               PRICE
RANCH     1250      2     1.0  Sheppard Avenue    $64,000
SPLIT     1190      1     1.0  Rand Street        $65,850
CONDO     1400      2     1.5  Market Street      $80,050
TWOSTORY  1810      4     3.0  Garris Street     $107,250
RANCH     1500      3     3.0  Kemble Avenue      $86,650
SPLIT     1615      4     3.0  West Drive         $94,450
SPLIT     1305      3     1.5  Graham Avenue      $73,650
CONDO     1390      3     2.5  Hampshire Avenue   $79,350
TWOSTORY  1040      2     1.0  Sanders Road       $55,850
CONDO     2105      4     2.5  Jeans Avenue      $127,150
RANCH     1535      3     3.0  State Highway      $89,100
TWOSTORY  1240      2     1.0  Fairbanks Circle   $69,250
RANCH      720      1     1.0  Nicholson Drive    $34,550
TWOSTORY  1745      4     2.5  Highland Road     $102,950
CONDO     1860      2     2.0  Arcata Avenue     $110,700>

Resources