BY-Group Processing in the DATA Step |
Overview |
The most common use of BY-group processing in the DATA step is to combine two or more SAS data sets using a BY statement with a SET, MERGE, MODIFY, or UPDATE statement. (If you use a SET, MERGE, or UPDATE statement with the BY statement, your observations must be grouped or ordered.) When processing these statements, SAS reads one observation at a time into the program data vector. With BY-group processing, SAS selects the observations from the data sets according to the values of the BY variable or variables. After processing all the observations from one BY group, SAS expects the next observation to be from the next BY group.
The BY statement modifies the action of the SET, MERGE, MODIFY, or UPDATE statement by controlling when the values in the program data vector are set to missing. During BY-group processing, SAS retains the values of variables until it has copied the last observation it finds for that BY group in any of the data sets. Without the BY statement, the SET statement sets variables to missing when it reads the last observation from any data set, and the MERGE statement does not set variables to missing after the DATA step starts reading observations into the program data vector.
Processing BY-Groups Conditionally |
You can process observations conditionally by using the subsetting IF or IF-THEN statements, or the SELECT statement, with the temporary variables FIRST.variable and LAST.variable (set up during BY-group processing). For example, you can use them to perform calculations for each BY group and to write an observation when the first or the last observation of a BY group has been read into the program data vector.
The following example computes annual payroll by department. It uses IF-THEN statements and the values of FIRST.variable and LAST.variable automatic variables to reset the value of PAYROLL to 0 at the beginning of each BY group and to write an observation after the last observation in a BY group is processed.
options pageno=1 nodate linesize=80 pagesize=60; data salaries; input Department $ Name $ WageCategory $ WageRate; datalines; BAD Carol Salaried 20000 BAD Elizabeth Salaried 5000 BAD Linda Salaried 7000 BAD Thomas Salaried 9000 BAD Lynne Hourly 230 DDG Jason Hourly 200 DDG Paul Salaried 4000 PPD Kevin Salaried 5500 PPD Amber Hourly 150 PPD Tina Salaried 13000 STD Helen Hourly 200 STD Jim Salaried 8000 ; proc print data=salaries; run;
proc sort data=salaries out=temp; by Department; run; data budget (keep=Department Payroll); set temp; by Department; if WageCategory='Salaried' then YearlyWage=WageRate*12; else if WageCategory='Hourly' then YearlyWage=WageRate*2000; /* SAS sets FIRST.variable to 1 if this is a new */ /* department in the BY group. */ if first.Department then Payroll=0; Payroll+YearlyWage; /* SAS sets LAST.variable to 1 if this is the last */ /* department in the current BY group. */ if last.Department; run; proc print data=budget; format Payroll dollar10.; title 'Annual Payroll by Department'; run;
Output from Conditional BY-Group Processing
Annual Payroll by Department 1 Obs Department Payroll 1 BAD $952,000 2 DDG $448,000 3 PPD $522,000 4 STD $496,000
Data Not in Alphabetic or Numeric Order |
In BY-group processing, you can use data that is arranged in an order other than alphabetic or numeric, such as by calendar month or by category. To do this, use the NOTSORTED option in a BY statement when you use a SET statement. The NOTSORTED option in the BY statement tells SAS that the data is not in alphabetic or numeric order, but that it is arranged in groups by the values of the BY variable. You cannot use the NOTSORTED option with the MERGE statement, the UPDATE statement, or when the SET statement lists more than one data set.
This example assumes that the data is grouped by the character variable MONTH. The subsetting IF statement conditionally writes an observation, based on the value of LAST.month. This DATA step writes an observation only after processing the last observation in each BY group.
data total_sale(drop=sales); set region.sales by month notsorted; total+sales; if last.month; run;
Data Grouped by Formatted Values |
Use the GROUPFORMAT option in the BY statement to ensure that
formatted values are used to group observations when a FORMAT statement and a BY statement are used together in a DATA step
the FIRST.variable and LAST.variable are assigned by the formatted values of the variable
The GROUPFORMAT option is valid only in the DATA step that creates the SAS data set. It is particularly useful with user-defined formats. The following example illustrates the use of the GROUPFORMAT option.
proc format; value range low -55 = 'Under 55' 55-60 = '55 to 60' 60-65 = '60 to 65' 65-70 = '65 to 70' other = 'Over 70'; run; proc sort data=class out=sorted_class; by height; run; data _null_; format height range.; set sorted_class; by height groupformat; if first.height then put 'Shortest in ' height 'measures ' height:best12.; run;
SAS writes the following output to the log:
Shortest in Under 55 measures 51.3 Shortest in 55 to 60 measures 56.3 Shortest in 60 to 65 measures 62.5 Shortest in 65 to 70 measures 65.3 Shortest in Over 70 measures 72
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.