Previous Page | Next Page

Conditionally Processing Observations from Multiple SAS Data Sets

Performing a Calculation Based on the Last Observation


Understanding When the Last Observation Is Processed

Many applications require that you determine when the DATA step processes the last observation in the input data set. For example, you might want to perform calculations only on the last observation in a data set, or you might want to write an observation only after the last observation has been processed. For this purpose, you can use the END= option for the SET, MERGE, MODIFY, or UPDATE statement. The syntax for this option is:

SET SAS-data-set-list END=variable;

The END= option defines a temporary variable whose value is 1 when the DATA step is processing the last observation. At all other times, the value of variable is 0. Although the DATA step can use the END= variable, SAS does not add it to the resulting data set.

Note:    Using More Than One Observation in a Calculation explains how to use the END= option in the SET statement with a single data set. The END= option works the same way with multiple data sets, but it is important to note that END= is set to 1 only when the last observation from all input data sets is being processed.  [cautionend]


The Program

This example uses the data in SOUTHAMERICAN and EUROPEAN to calculate how many years a team from each continent won the World Cup from 1954 to 1998.

To perform this calculation, this program must perform the following tasks:

  1. identify on which continent a country is located.

  2. keep a running total of how many times a team from each continent won the World Cup.

  3. after processing all observations, multiply the final total for each continent by 4 (the length of time between World Cups) to determine the length of time each continent has been a World Cup champion.

  4. write only the final observation to the output data set. The variables that contain the totals do not contain the final total until the last observation is processed.

The following DATA step calculates the running totals and produces the output data set that contains only those totals.

data timespan (keep=YearsSouthAmerican keep=YearsEuropean);4 
   set southamerican (in=S) european  end=LastYear;1 3 
   by Year;
   if result='won' then     
      do;
         if S then SouthAmericanWins+1;2 
         else EuropeanWins+1;2 
      end;
   if lastyear then3 
      do;
         YearsSouthAmerican=SouthAmericanWins*4;
         YearsEuropean=EuropeanWins*4;
         output;4 
      end;

proc print data=timespan;
   title 'Total Years as Reigning World Cup Champions';
   title2 'from 1954 to 1998';
run;

The following list corresponds to the numbered items in the preceding program:

[1] The END= option creates the temporary variable LastYear. The value of LastYear is 0 until the DATA step begins processing the last observation. At that point, the value of LastYear is set to 1.

[2] Two new variables, SouthAmericanWins and EuropeanWins, keep a running total of the number of victories each continent achieves. For each observation in which the value of the variable Result is won, a different sum statement executes, based on the data set that the observation came from:

SouthAmericanWins+1;

or

EuropeanWins+1;

[3] When the DATA step begins processing the last observation, the value of LASTYEAR changes from 0 to 1. When this change occurs, the conditional statement IF LastYear becomes true, and the statements that follow it are executed. The assignment statement multiplies the total number of victories for each continent by 4 and assigns the result to the appropriate variable, YearsSouthAmerican or YearsEuropean.

[4] The OUTPUT statement writes the observation to the newly created data set. Remember that the DATA step automatically writes an observation at the end of each iteration. However, the OUTPUT statement turns off this automatic feature. The DATA step writes only the last observation to TIMESPAN. When the DATA step writes the observation from the program data vector to the output data set, it writes only two variables, YearsSouthAmerican and YearsEuropean, as directed by the KEEP= data set option in the DATA statement.

Using the END= Option to Perform a Calculation Based on the Last Observation in the Data Sets

                  Total Years as Reigning World Cup Champions                  3
                               from 1954 to 1998

                                   Years
                                   South       Years
                          Obs    American    European

                           1        24          24   

Previous Page | Next Page | Top of Page