Previous Page | Next Page

Practical Problem-Solving with PROC SQL

Overlaying Missing Data Values


Problem

You are forming teams for a new league by analyzing the averages of bowlers when they were members of other bowling leagues. When possible you will use each bowler's most recent league average. However, if a bowler was not in a league last year, then you will use the bowler's average from the prior year.


Background Information

There are two tables, LEAGUE1 and LEAGUE2, that contain bowling averages for last year and the prior year respectively. The structure of the tables is not identical because the data was compiled by two different secretaries. However, the tables do contain essentially the same type of data.

Sample Input Tables for Overlaying Missing Values

                         Bowling Averages from League1

                     Fullname              Bowler  AvgScore
                     --------------------------------------
                     Alexander Delarge     4224         164
                     John T Chance         4425           .
                     Jack T Colton         4264           .
                                           1412         141
                     Andrew Shepherd       4189         185
                         Bowling Averages from League2

                  FirstName   LastName         AMFNo  AvgScore
                  --------------------------------------------
                  Alex        Delarge          4224        156
                  Mickey      Raymond          1412          .
                                               4264        174
                  Jack        Chance           4425          .
                  Patrick     O'Malley         4118        164

Solution

The following PROC SQL code combines the information from two tables, LEAGUE1 and LEAGUE2. The program uses all the values from the LEAGUE1 table, if available, and replaces any missing values with the corresponding values from the LEAGUE2 table. The results are shown in the following output.

options nodate nonumber linesize=80 pagesize=60;

proc sql;
   title "Averages from Last Year's League When Possible";
   title2 "Supplemented when Available from Prior Year's League";
   select coalesce(lastyr.fullname,trim(prioryr.firstname)
                   ||' '||prioryr.lastname)as Name format=$26.,
          coalesce(lastyr.bowler,prioryr.amfno)as Bowler,
          coalesce(lastyr.avgscore,prioryr.avgscore)as Average format=8.
      from league1 as lastyr full join league2 as prioryr
           on lastyr.bowler=prioryr.amfno
      order by Bowler;

PROC SQL Output for Overlaying Missing Values

                 Averages from Last Year's League When Possible
              Supplemented when Available from Prior Year's League

                  Name                        Bowler   Average
                  --------------------------------------------
                  Mickey Raymond              1412         141
                  Patrick O'Malley            4118         164
                  Andrew Shepherd             4189         185
                  Alexander Delarge           4224         164
                  Jack T Colton               4264         174
                  John T Chance               4425           .

How It Works

This solution uses a full join to obtain all rows from LEAGUE1 as well as all rows from LEAGUE2. The program uses the COALESCE function on each column so that, whenever possible, there is a value for each column of a row. Using the COALESCE function on a list of expressions that is enclosed in parentheses returns the first nonmissing value that is found. For each row, the following code returns the AvgScore column from LEAGUE1 for Average:

coalesce(lastyr.avgscore,prioryr.avgscore) as Average format=8.  

If this value of AvgScore is missing, then COALESCE returns the AvgScore column from LEAGUE2 for Average. If this value of AvgScore is missing, then COALESCE returns a missing value for Average.

In the case of the Name column, the COALESCE function returns the value of FullName from LEAGUE1 if it exists. If not, then the value is obtained from LEAGUE2 by using both the TRIM function and concatenation operators to combine the first name and last name columns:

trim(prioryr.firstname)||' '||prioryr.lastname

Finally, the table is ordered by Bowler. The Bowler column is the result of the COALESCE function.

coalesce(lastyr.bowler,prioryr.amfno)as Bowler

Because the value is obtained from either table, you cannot confidently order the output by either the value of Bowler in LEAGUE1 or the value of AMFNo in LEAGUE 2, but only by the value that results from the COALESCE function.

Previous Page | Next Page | Top of Page