Previous Page | Next Page

Working with Grouped or Sorted Observations

Working with Grouped Data


Understanding the Basics of Grouping Data

The basic method for grouping data is to use a BY statement:

BY list-of-variables;

The BY statement can be used in a DATA step with a SET, MERGE, MODIFY, or UPDATE statement, or it can be used in SAS procedures.

To work with grouped data using the SET, MERGE, MODIFY, or UPDATE statements, the data must meet these conditions:

If the third condition is not met, the data are in a SAS data set but are not arranged in the groups you want, you can order the data using the SORT procedure (discussed in the next section).

Once the SAS data set is arranged in some order, you can use the BY statement to group values of one or more common variables.


Grouping Observations with the SORT Procedure

All observations in the input data set must be in a particular order. To meet this condition, the observations in MYLIB.ARCH_OR_SCEN can be ordered by the values of TourType, architecture and scenery :

proc sort data=mylib.arch_or_scen out=tourorder;
   by TourType;
run;

The SORT procedure sorts the data set MYLIB.ARCH_OR_SCEN alphabetically according to the values of TourType. The sorted observations go into a new data set specified by the OUT= option. In this example, TOURORDER is the sorted data set. If the OUT= option is omitted, the sorted version of the data set replaces the data set MYLIB.ARCH_OR_SCEN.

The SORT procedure does not produce output other than the sorted data set. A message in the SAS log says that the SORT procedure was executed:

Message That the SORT Procedure Has Executed Successfully

2   proc sort data=mylib.arch_or_scen out=tourorder;
3      by TourType;
4   run;
NOTE: There were 8 observations read from the data set MYLIB.ARCH_OR_SCEN.
NOTE: The data set WORK.TOURORDER has 8 observations and 5 variables.
NOTE: PROCEDURE SORT used:
      

To see the sorted data set, add a PROC PRINT step to the program:

options pagesize=60 linesize=80 pageno=1 nodate;
proc sort data=mylib.arch_or_scen out=tourorder;
   by TourType;
run;

proc print data=tourorder;
   var TourType Country Nights LandCost Vendor;
   title 'Tours Sorted by Architecture or Scenic Tours';
run;

The following output displays the results:

Displaying the Sorted Output

                  Tours Sorted by Architecture or Scenic Tours                 1

                                                        Land
        Obs    TourType        Country        Nights    Cost    Vendor

         1     architecture    Spain            10       510    World   
         2     architecture    Japan             8       720    Express 
         3     architecture    France            8       575    World   
         4     architecture    Italy             8       468    Express 
         5     scenery         Switzerland       9       734    World   
         6     scenery         Ireland           7       558    Express 
         7     scenery         New Zealand      16      1489    Southsea
         8     scenery         Greece           12       698    Express 

By default, SAS arranges groups in ascending order of the BY values, smallest to largest. Sorting a data set does not change the order of the variables within it. However, most examples in this section use a VAR statement in the PRINT procedure to display the BY variable in the first column. (The PRINT procedure and other procedures used in this documentation can also produce a separate report for each BY group.)


Grouping by More Than One Variable

You can group observations by as many variables as you want. This example groups observations by TourType, Vendor, and LandCost:

options pagesize=60 linesize=80 pageno=1 nodate;
proc sort data=mylib.arch_or_scen out=tourorder2;
   by TourType Vendor Landcost;
run;

proc print data=tourorder2;
   var TourType Vendor Landcost Country Nights;
   title 'Tours Grouped by Type of Tour, Vendor, and Price';
run;

The following output displays the results:

Grouping by Several Variables

                Tours Grouped by Type of Tour, Vendor, and Price               1

                                           Land
        Obs    TourType        Vendor      Cost    Country        Nights

         1     architecture    Express      468    Italy             8  
         2     architecture    Express      720    Japan             8  
         3     architecture    World        510    Spain            10  
         4     architecture    World        575    France            8  
         5     scenery         Express      558    Ireland           7  
         6     scenery         Express      698    Greece           12  
         7     scenery         Southsea    1489    New Zealand      16  
         8     scenery         World        734    Switzerland       9  

As this example shows, SAS groups the observations by the first variable that is named within those groups, by the second variable named; and so on. The groups defined by all variables contain only one observation each. In this example, no two variables have the same values for all observations. In other words, this example does not have any duplicate entries.


Arranging Groups in Descending Order

In the data sets that are grouped by TourType, the group for architecture comes before the group for scenery because architecture begins with an "a"; "a" is smaller than "s" in computer processing. (The order of characters, known as their collating sequence, is discussed later in this section.) To produce a descending order for a particular variable, place the DESCENDING option before the name of the variable in the BY statement of the SORT procedure. In the next example, the observations are grouped in descending order by TourType, but in ascending order by Vendor and LandCost:

options pagesize=60 linesize=80 pageno=1 nodate;
proc sort data=mylib.arch_or_scen out=tourorder3;
   by descending TourType Vendor LandCost;
run;

proc print data=tourorder3;
   var TourType Vendor LandCost Country Nights;
   title 'Descending Order of TourType';
run;

The following output displays the results:

Combining Descending and Ascending Sorted Observations

                          Descending Order of TourType                         1

                                           Land
        Obs    TourType        Vendor      Cost    Country        Nights

         1     scenery         Express      558    Ireland           7  
         2     scenery         Express      698    Greece           12  
         3     scenery         Southsea    1489    New Zealand      16  
         4     scenery         World        734    Switzerland       9  
         5     architecture    Express      468    Italy             8  
         6     architecture    Express      720    Japan             8  
         7     architecture    World        510    Spain            10  
         8     architecture    World        575    France            8  

Finding the First or Last Observation in a Group

If you do not want to display the entire data set, how can you create a data set containing only the least expensive tour that features architecture, and the least expensive tour that features scenery?

First, sort the data set by TourType and LandCost:

options pagesize=60 linesize=80 pageno=1 nodate;
proc sort data=mylib.arch_or_scen out=tourorder4;
   by TourType LandCost;
run;

proc print data=tourorder4;
   var TourType LandCost Country Nights Vendor;
   title 'Tours Arranged by TourType and LandCost';
run;

The following output displays the results:

Sorting to Find the Least Expensive Tours

                    Tours Arranged by TourType and LandCost                    1

                               Land
        Obs    TourType        Cost    Country        Nights    Vendor

         1     architecture     468    Italy             8      Express 
         2     architecture     510    Spain            10      World   
         3     architecture     575    France            8      World   
         4     architecture     720    Japan             8      Express 
         5     scenery          558    Ireland           7      Express 
         6     scenery          698    Greece           12      Express 
         7     scenery          734    Switzerland       9      World   
         8     scenery         1489    New Zealand      16      Southsea

You sorted LandCost in ascending order, so the first observation in each value of TourType has the lowest value of LandCost. If you can locate the first observation in each BY group in a DATA step, you can use a subsetting IF statement to select that observation. But how can you locate the first observation with each value of TourType?

When you use a BY statement in a DATA step, SAS automatically creates two additional variables for each variable in the BY statement. One is named FIRST.variable, where variable is the name of the BY variable, and the other is named LAST.variable. Their values are either 1 or 0. They exist in the program data vector and are available for DATA step programming, but SAS does not add them to the SAS data set being created. For example, the DATA step begins with these statements:

data lowcost;
   set tourorder4;
   by TourType;
   ...more SAS statements...
run;

The BY statement causes SAS to create one variable called FIRST.TOURTYPE and another variable called LAST.TOURTYPE. When SAS processes the first observation with the value architecture , the value of FIRST.TOURTYPE is 1; in other observations with the value architecture , it is 0. Similarly, when SAS processes the last observation with the value architecture , the value of LAST.TOURTYPE is 1; in other architecture observations, it is 0. The same result occurs in the scenery group with the observations.

SAS does not write FIRST. and LAST. variables to the output data set, so you cannot display their values with the PRINT procedure. Therefore, the simplest method of displaying the values of FIRST. and LAST. variables is to assign their values to other variables. This example assigns the value of FIRST.TOURTYPE to a variable named FirstTour and the value of LAST.TOURTYPE to a variable named LastTour:

options pagesize=60 linesize=80 pageno=1 nodate;
data temp;
   set tourorder4;
   by TourType;
   FirstTour = first.TourType;
   LastTour = last.TourType;
run;
proc print data=temp;
   var Country Tourtype FirstTour LastTour;
   title 'Specifying FIRST.TOURTYPE and LAST.TOURTYPE';
run;

The following output displays the results:

Demonstrating FIRST. and LAST. Values

                  Specifying FIRST.TOURTYPE and LAST.TOURTYPE                  1

                                                    First    Last
              Obs    Country        TourType         Tour    Tour

               1     Italy          architecture      1        0 
               2     Spain          architecture      0        0 
               3     France         architecture      0        0 
               4     Japan          architecture      0        1 
               5     Ireland        scenery           1        0 
               6     Greece         scenery           0        0 
               7     Switzerland    scenery           0        0 
               8     New Zealand    scenery           0        1 

In this data set, Italy is the first observation with the value architecture ; for that observation, the value of FIRST.TOURTYPE is 1. Italy is not the last observation with the value architecture , so its value of LAST.TOURTYPE is 0. The observations for Spain and France are neither the first nor the last with the value architecture ; both FIRST.TOURTYPE and LAST.TOURTYPE are 0 for them. Japan is the last with the value architecture ; the value of LAST.TOURTYPE is 1. The same rules apply to observations in the scenery group.

Now you're ready to use FIRST.TOURTYPE in a subsetting IF statement. When the data are sorted by TourType and LandCost, selecting the first observation in each type of tour gives you the lowest price of any tour in that category:

options pagesize=60 linesize=80 pageno=1 nodate;
proc sort data=mylib.arch_or_scen out=tourorder4;
   by TourType LandCost;
run;

data lowcost;
   set tourorder4;
   by TourType;
   if first.TourType;
run;

proc print data=lowcost;
   title 'Least Expensive Tour for Each Type of Tour';
run;

The following output displays the results:

Selecting One Observation from Each BY Group

                   Least Expensive Tour for Each Type of Tour                  1

                                                      Land
          Obs    Country    TourType        Nights    Cost    Vendor

           1     Italy      architecture       8       468    Express
           2     Ireland    scenery            7       558    Express

Previous Page | Next Page | Top of Page