Working with Grouped or Sorted Observations |
Understanding the Basics of Grouping Data |
The basic method for grouping data is to use a BY statement:
BY list-of-variables; |
The BY statement can be used in a DATA step with a SET, MERGE, MODIFY, or UPDATE statement, or it can be used in SAS procedures.
To work with grouped data using the SET, MERGE, MODIFY, or UPDATE statements, the data must meet these conditions:
The observations must be in a SAS data set, not an external file.
The variables that define the groups must appear in the BY statement.
All observations in the input data set must be in ascending or descending numeric or character order, or grouped in some way, such as by calendar month or a formatted value, according to the variables that will be specified in the BY statement.
Note: If you use the MODIFY statement, the input data does not need to be in any order. However, ordering the data can improve performance.
If the third condition is not met, the data are in a SAS data set but are not arranged in the groups you want, you can order the data using the SORT procedure (discussed in the next section).
Once the SAS data set is arranged in some order, you can use the BY statement to group values of one or more common variables.
Grouping Observations with the SORT Procedure |
All observations in the input data set must be in a particular order. To meet this condition, the observations in MYLIB.ARCH_OR_SCEN can be ordered by the values of TourType, architecture and scenery :
proc sort data=mylib.arch_or_scen out=tourorder; by TourType; run;
The SORT procedure sorts the data set MYLIB.ARCH_OR_SCEN alphabetically according to the values of TourType. The sorted observations go into a new data set specified by the OUT= option. In this example, TOURORDER is the sorted data set. If the OUT= option is omitted, the sorted version of the data set replaces the data set MYLIB.ARCH_OR_SCEN.
The SORT procedure does not produce output other than the sorted data set. A message in the SAS log says that the SORT procedure was executed:
Message That the SORT Procedure Has Executed Successfully
2 proc sort data=mylib.arch_or_scen out=tourorder; 3 by TourType; 4 run; NOTE: There were 8 observations read from the data set MYLIB.ARCH_OR_SCEN. NOTE: The data set WORK.TOURORDER has 8 observations and 5 variables. NOTE: PROCEDURE SORT used:
To see the sorted data set, add a PROC PRINT step to the program:
options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=tourorder; by TourType; run; proc print data=tourorder; var TourType Country Nights LandCost Vendor; title 'Tours Sorted by Architecture or Scenic Tours'; run;
The following output displays the results:
Tours Sorted by Architecture or Scenic Tours 1 Land Obs TourType Country Nights Cost Vendor 1 architecture Spain 10 510 World 2 architecture Japan 8 720 Express 3 architecture France 8 575 World 4 architecture Italy 8 468 Express 5 scenery Switzerland 9 734 World 6 scenery Ireland 7 558 Express 7 scenery New Zealand 16 1489 Southsea 8 scenery Greece 12 698 Express
By default, SAS arranges groups in ascending order of the BY values, smallest to largest. Sorting a data set does not change the order of the variables within it. However, most examples in this section use a VAR statement in the PRINT procedure to display the BY variable in the first column. (The PRINT procedure and other procedures used in this documentation can also produce a separate report for each BY group.)
Grouping by More Than One Variable |
You can group observations by as many variables as you want. This example groups observations by TourType, Vendor, and LandCost:
options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=tourorder2; by TourType Vendor Landcost; run; proc print data=tourorder2; var TourType Vendor Landcost Country Nights; title 'Tours Grouped by Type of Tour, Vendor, and Price'; run;
The following output displays the results:
Tours Grouped by Type of Tour, Vendor, and Price 1 Land Obs TourType Vendor Cost Country Nights 1 architecture Express 468 Italy 8 2 architecture Express 720 Japan 8 3 architecture World 510 Spain 10 4 architecture World 575 France 8 5 scenery Express 558 Ireland 7 6 scenery Express 698 Greece 12 7 scenery Southsea 1489 New Zealand 16 8 scenery World 734 Switzerland 9
As this example shows, SAS groups the observations by the first variable that is named within those groups, by the second variable named; and so on. The groups defined by all variables contain only one observation each. In this example, no two variables have the same values for all observations. In other words, this example does not have any duplicate entries.
Arranging Groups in Descending Order |
In the data sets that are grouped by TourType, the group for architecture comes before the group for scenery because architecture begins with an "a"; "a" is smaller than "s" in computer processing. (The order of characters, known as their collating sequence, is discussed later in this section.) To produce a descending order for a particular variable, place the DESCENDING option before the name of the variable in the BY statement of the SORT procedure. In the next example, the observations are grouped in descending order by TourType, but in ascending order by Vendor and LandCost:
options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=tourorder3; by descending TourType Vendor LandCost; run; proc print data=tourorder3; var TourType Vendor LandCost Country Nights; title 'Descending Order of TourType'; run;
The following output displays the results:
Combining Descending and Ascending Sorted Observations
Descending Order of TourType 1 Land Obs TourType Vendor Cost Country Nights 1 scenery Express 558 Ireland 7 2 scenery Express 698 Greece 12 3 scenery Southsea 1489 New Zealand 16 4 scenery World 734 Switzerland 9 5 architecture Express 468 Italy 8 6 architecture Express 720 Japan 8 7 architecture World 510 Spain 10 8 architecture World 575 France 8
Finding the First or Last Observation in a Group |
If you do not want to display the entire data set, how can you create a data set containing only the least expensive tour that features architecture, and the least expensive tour that features scenery?
First, sort the data set by TourType and LandCost:
options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=tourorder4; by TourType LandCost; run; proc print data=tourorder4; var TourType LandCost Country Nights Vendor; title 'Tours Arranged by TourType and LandCost'; run;
The following output displays the results:
Sorting to Find the Least Expensive Tours
Tours Arranged by TourType and LandCost 1 Land Obs TourType Cost Country Nights Vendor 1 architecture 468 Italy 8 Express 2 architecture 510 Spain 10 World 3 architecture 575 France 8 World 4 architecture 720 Japan 8 Express 5 scenery 558 Ireland 7 Express 6 scenery 698 Greece 12 Express 7 scenery 734 Switzerland 9 World 8 scenery 1489 New Zealand 16 Southsea
You sorted LandCost in ascending order, so the first observation in each value of TourType has the lowest value of LandCost. If you can locate the first observation in each BY group in a DATA step, you can use a subsetting IF statement to select that observation. But how can you locate the first observation with each value of TourType?
When you use a BY statement in a DATA step, SAS automatically creates two additional variables for each variable in the BY statement. One is named FIRST.variable, where variable is the name of the BY variable, and the other is named LAST.variable. Their values are either 1 or 0. They exist in the program data vector and are available for DATA step programming, but SAS does not add them to the SAS data set being created. For example, the DATA step begins with these statements:
data lowcost; set tourorder4; by TourType; ...more SAS statements... run;
The BY statement causes SAS to create one variable called FIRST.TOURTYPE and another variable called LAST.TOURTYPE. When SAS processes the first observation with the value architecture , the value of FIRST.TOURTYPE is 1; in other observations with the value architecture , it is 0. Similarly, when SAS processes the last observation with the value architecture , the value of LAST.TOURTYPE is 1; in other architecture observations, it is 0. The same result occurs in the scenery group with the observations.
SAS does not write FIRST. and LAST. variables to the output data set, so you cannot display their values with the PRINT procedure. Therefore, the simplest method of displaying the values of FIRST. and LAST. variables is to assign their values to other variables. This example assigns the value of FIRST.TOURTYPE to a variable named FirstTour and the value of LAST.TOURTYPE to a variable named LastTour:
options pagesize=60 linesize=80 pageno=1 nodate; data temp; set tourorder4; by TourType; FirstTour = first.TourType; LastTour = last.TourType; run; proc print data=temp; var Country Tourtype FirstTour LastTour; title 'Specifying FIRST.TOURTYPE and LAST.TOURTYPE'; run;
The following output displays the results:
Demonstrating FIRST. and LAST. Values
Specifying FIRST.TOURTYPE and LAST.TOURTYPE 1 First Last Obs Country TourType Tour Tour 1 Italy architecture 1 0 2 Spain architecture 0 0 3 France architecture 0 0 4 Japan architecture 0 1 5 Ireland scenery 1 0 6 Greece scenery 0 0 7 Switzerland scenery 0 0 8 New Zealand scenery 0 1
In this data set, Italy is the first observation with the value architecture ; for that observation, the value of FIRST.TOURTYPE is 1. Italy is not the last observation with the value architecture , so its value of LAST.TOURTYPE is 0. The observations for Spain and France are neither the first nor the last with the value architecture ; both FIRST.TOURTYPE and LAST.TOURTYPE are 0 for them. Japan is the last with the value architecture ; the value of LAST.TOURTYPE is 1. The same rules apply to observations in the scenery group.
Now you're ready to use FIRST.TOURTYPE in a subsetting IF statement. When the data are sorted by TourType and LandCost, selecting the first observation in each type of tour gives you the lowest price of any tour in that category:
options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=tourorder4; by TourType LandCost; run; data lowcost; set tourorder4; by TourType; if first.TourType; run; proc print data=lowcost; title 'Least Expensive Tour for Each Type of Tour'; run;
The following output displays the results:
Selecting One Observation from Each BY Group
Least Expensive Tour for Each Type of Tour 1 Land Obs Country TourType Nights Cost Vendor 1 Italy architecture 8 468 Express 2 Ireland scenery 7 558 Express
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.