Previous Page | Next Page

Merging SAS Data Sets

Match-Merging


Merging with a BY Statement

Merging with a BY statement enables you to match observations according to the values of the BY variables that you specify. Before you can perform a match-merge, all data sets must be sorted by the variables that you want to use for the merge.

In order to understand match-merging, you must understand three key concepts:

BY variable

is a variable named in a BY statement.

BY value

is the value of a BY variable.

BY group

is the set of all observations with the same value for the BY variable (if there is only one BY variable). If you use more than one variable in a BY statement, then a BY group is the set of observations with a unique combination of values for those variables. In discussions of match-merging, BY groups commonly span more than one data set.


Input SAS Data Set for Examples

For example, the director of a small repertory theater company, the Little Theater, maintains company records in two SAS data sets, COMPANY and FINANCE.

Data Set Variable Description
COMPANY Name player's name
Age player's age
Gender player's gender
FINANCE Name player's name
IdNumber player's employee ID number
Salary player's annual salary

The following program creates, sorts, and displays COMPANY and FINANCE:

data company;
   input Name $ 1-25 Age 27-28 Gender $ 30;
   datalines;
Vincent, Martina          34 F
Phillipon, Marie-Odile    28 F
Gunter, Thomas            27 M
Harbinger, Nicholas       36 M
Benito, Gisela            32 F
Rudelich, Herbert         39 M
Sirignano, Emily          12 F
Morrison, Michael         32 M
;

proc sort data=company;
   by Name;
run;

data finance;
   input IdNumber $ 1-11 Name $ 13-40 Salary;
   datalines;
074-53-9892 Vincent, Martina             35000
776-84-5391 Phillipon, Marie-Odile       29750
929-75-0218 Gunter, Thomas               27500
446-93-2122 Harbinger, Nicholas          33900
228-88-9649 Benito, Gisela               28000
029-46-9261 Rudelich, Herbert            35000
442-21-8075 Sirignano, Emily             5000
;
proc sort data=finance;
   by Name;
run;

proc print data=company;
   title 'Little Theater Company Roster';
run;

proc print data=finance;
   title 'Little Theater Employee Information';
run;

The following output displays the data sets. Notice that the FINANCE data set does not contain an observation for Michael Morrison.

The COMPANY and FINANCE Data Sets

                         Little Theater Company Roster                         1

                 Obs    Name                      Age    Gender

                  1     Benito, Gisela             32      F   
                  2     Gunter, Thomas             27      M   
                  3     Harbinger, Nicholas        36      M   
                  4     Morrison, Michael          32      M   
                  5     Phillipon, Marie-Odile     28      F   
                  6     Rudelich, Herbert          39      M   
                  7     Sirignano, Emily           12      F   
                  8     Vincent, Martina           34      F   
                      Little Theater Employee Information                      2

             Obs     IdNumber      Name                      Salary

              1     228-88-9649    Benito, Gisela             28000
              2     929-75-0218    Gunter, Thomas             27500
              3     446-93-2122    Harbinger, Nicholas        33900
              4     776-84-5391    Phillipon, Marie-Odile     29750
              5     029-46-9261    Rudelich, Herbert          35000
              6     442-21-8075    Sirignano, Emily            5000
              7     074-53-9892    Vincent, Martina           35000

The Program

To avoid having to maintain two separate data sets, the director wants to merge the records for each player from both data sets into a new data set that contains all the variables. The variable that is common to both data sets is Name. Therefore, Name is the appropriate BY variable.

The data sets are already sorted by NAME, so no further sorting is required. The following program merges them by NAME:

data employee_info;
   merge company finance;
   by name;
run;

proc print data=employee_info;
   title 'Little Theater Employee Information';
   title2 '(including personal and financial information)';
run;

The following output displays the merged data set:

Match-Merging

                      Little Theater Employee Information                      1
                 (including personal and financial information)

    Obs    Name                      Age    Gender     IdNumber      Salary

     1     Benito, Gisela             32      F       228-88-9649     28000
     2     Gunter, Thomas             27      M       929-75-0218     27500
     3     Harbinger, Nicholas        36      M       446-93-2122     33900
     4     Morrison, Michael          32      M                           .
     5     Phillipon, Marie-Odile     28      F       776-84-5391     29750
     6     Rudelich, Herbert          39      M       029-46-9261     35000
     7     Sirignano, Emily           12      F       442-21-8075      5000
     8     Vincent, Martina           34      F       074-53-9892     35000

Explanation

The new data set contains one observation for each player in the company. Each observation contains all the variables from both data sets. Notice in particular the fourth observation. The data set FINANCE does not have an observation for Michael Morrison. In this case, the values of the variables that are unique to FINANCE (IdNumber and Salary) are missing.


Match-Merging Data Sets with Multiple Observations in a BY Group


Input SAS Data Set for Examples

The Little Theater has a third data set, REPERTORY, that tracks the casting assignments in each of the season's plays. REPERTORY contains these variables:

Play

is the name of one of the plays in the repertory.

Role

is the name of a character in Play.

IdNumber

is the employee ID number of the player playing Role.

The following program creates and displays REPERTORY:

data repertory;
   input Play $ 1-23 Role $ 25-48 IdNumber $ 50-60;
   datalines;
No Exit                 Estelle                  074-53-9892
No Exit                 Inez                     776-84-5391
No Exit                 Valet                    929-75-0218
No Exit                 Garcin                   446-93-2122
Happy Days              Winnie                   074-53-9892
Happy Days              Willie                   446-93-2122
The Glass Menagerie     Amanda Wingfield         228-88-9649
The Glass Menagerie     Laura Wingfield          776-84-5391
The Glass Menagerie     Tom Wingfield            929-75-0218
The Glass Menagerie     Jim O'Connor             029-46-9261
The Dear Departed       Mrs. Slater              228-88-9649
The Dear Departed       Mrs. Jordan              074-53-9892
The Dear Departed       Henry Slater             029-46-9261
The Dear Departed       Ben Jordan               446-93-2122
The Dear Departed       Victoria Slater          442-21-8075
The Dear Departed       Abel Merryweather        929-75-0218
;

proc print data=repertory;
   title 'Little Theater Season Casting Assignments';
run;

The following output displays the REPERTORY data set:

The REPERTORY Data Set

                   Little Theater Season Casting Assignments                   1

         Obs    Play                   Role                  IdNumber

           1    No Exit                Estelle              074-53-9892
           2    No Exit                Inez                 776-84-5391
           3    No Exit                Valet                929-75-0218
           4    No Exit                Garcin               446-93-2122
           5    Happy Days             Winnie               074-53-9892
           6    Happy Days             Willie               446-93-2122
           7    The Glass Menagerie    Amanda Wingfield     228-88-9649
           8    The Glass Menagerie    Laura Wingfield      776-84-5391
           9    The Glass Menagerie    Tom Wingfield        929-75-0218
          10    The Glass Menagerie    Jim O'Connor         029-46-9261
          11    The Dear Departed      Mrs. Slater          228-88-9649
          12    The Dear Departed      Mrs. Jordan          074-53-9892
          13    The Dear Departed      Henry Slater         029-46-9261
          14    The Dear Departed      Ben Jordan           446-93-2122
          15    The Dear Departed      Victoria Slater      442-21-8075
          16    The Dear Departed      Abel Merryweather    929-75-0218

To maintain confidentiality during preliminary casting, this data set identifies players by employee ID number. However, casting decisions are now final, and the manager wants to replace each employee ID number with the player's name. Of course, it is possible to re-create the data set, entering each player's name instead of the employee ID number in the raw data. However, it is more efficient to make use of the data set FINANCE, which already contains the name and employee ID number of all players (see The COMPANY and FINANCE Data Sets). When the data sets are merged, SAS takes care of adding the players' names to the data set.

Of course, before you can merge the data sets, you must sort them by IdNumber.

proc sort data=finance;
   by IdNumber;
run;

proc sort data=repertory;
   by IdNumber;
run;

proc print data=finance;
   title 'Little Theater Employee Information';
   title2 '(sorted by employee ID number)';
run;

proc print data=repertory;
   title 'Little Theater Season Casting Assignments';
   title2 '(sorted by employee ID number)';
run;

The following output displays the FINANCE and REPERTORY data sets, sorted by IdNumber:

Sorting the FINANCE and REPERTORY Data Sets by IdNumber

                      Little Theater Employee Information                      1
                         (sorted by employee ID number)

             Obs     IdNumber      Name                      Salary

              1     029-46-9261    Rudelich, Herbert          35000
              2     074-53-9892    Vincent, Martina           35000
              3     228-88-9649    Benito, Gisela             28000
              4     442-21-8075    Sirignano, Emily            5000
              5     446-93-2122    Harbinger, Nicholas        33900
              6     776-84-5391    Phillipon, Marie-Odile     29750
              7     929-75-0218    Gunter, Thomas             27500
                   Little Theater Season Casting Assignments                   2
                         (sorted by employee ID number)

         Obs    Play                   Role                  IdNumber

           1    The Glass Menagerie    Jim O'Connor         029-46-9261
           2    The Dear Departed      Henry Slater         029-46-9261
           3    No Exit                Estelle              074-53-9892
           4    Happy Days             Winnie               074-53-9892
           5    The Dear Departed      Mrs. Jordan          074-53-9892
           6    The Glass Menagerie    Amanda Wingfield     228-88-9649
           7    The Dear Departed      Mrs. Slater          228-88-9649
           8    The Dear Departed      Victoria Slater      442-21-8075
           9    No Exit                Garcin               446-93-2122
          10    Happy Days             Willie               446-93-2122
          11    The Dear Departed      Ben Jordan           446-93-2122
          12    No Exit                Inez                 776-84-5391
          13    The Glass Menagerie    Laura Wingfield      776-84-5391
          14    No Exit                Valet                929-75-0218
          15    The Glass Menagerie    Tom Wingfield        929-75-0218
          16    The Dear Departed      Abel Merryweather    929-75-0218

These two data sets contain seven BY groups; that is, among the 23 observations are seven different values for the BY variable, IdNumber. The first BY group has a value of 029-46-9261 for IdNumber. FINANCE has one observation in this BY group; REPERTORY has two. The last BY group has a value of 929-75-0218 for IdNumber. FINANCE has one observation in this BY group; REPERTORY has three.


The Program

The following program merges the data sets FINANCE and REPERTORY and illustrates what happens when a BY group in one data set has more observations in it than the same BY group in the other data set.

The resulting data set contains all variables from both data sets.

options linesize=120;

data repertory_name;
   merge finance repertory;
   by IdNumber;
run;

proc print data=repertory_name;
   title 'Little Theater Season Casting Assignments';
   title2 'with employee financial information';
run;

Note:   The OPTIONS statement extends the line size to 120 so that PROC PRINT can display all variables on one line. Most output in this section is created with line size set to 76 in the OPTIONS statement. An OPTIONS statement appears only in examples using a different line size. When you set the LINESIZE= option, it remains in effect until you reset it or end the SAS session.  [cautionend]

The following output displays the merged data set:

Match-Merge with Multiple Observations in a BY Group

                                       Little Theater Season Casting Assignments                                       1
                                          with employee financial information

           Obs     IdNumber      Name                      Salary    Play                   Role

             1    029-46-9261    Rudelich, Herbert          35000    The Glass Menagerie    Jim O'Connor     
             2    029-46-9261    Rudelich, Herbert          35000    The Dear Departed      Henry Slater     
             3    074-53-9892    Vincent, Martina           35000    No Exit                Estelle          
             4    074-53-9892    Vincent, Martina           35000    Happy Days             Winnie           
             5    074-53-9892    Vincent, Martina           35000    The Dear Departed      Mrs. Jordan      
             6    228-88-9649    Benito, Gisela             28000    The Glass Menagerie    Amanda Wingfield 
             7    228-88-9649    Benito, Gisela             28000    The Dear Departed      Mrs. Slater      
             8    442-21-8075    Sirignano, Emily            5000    The Dear Departed      Victoria Slater  
             9    446-93-2122    Harbinger, Nicholas        33900    No Exit                Garcin           
            10    446-93-2122    Harbinger, Nicholas        33900    Happy Days             Willie           
            11    446-93-2122    Harbinger, Nicholas        33900    The Dear Departed      Ben Jordan       
            12    776-84-5391    Phillipon, Marie-Odile     29750    No Exit                Inez             
            13    776-84-5391    Phillipon, Marie-Odile     29750    The Glass Menagerie    Laura Wingfield  
            14    929-75-0218    Gunter, Thomas             27500    No Exit                Valet            
            15    929-75-0218    Gunter, Thomas             27500    The Glass Menagerie    Tom Wingfield    
            16    929-75-0218    Gunter, Thomas             27500    The Dear Departed      Abel Merryweather

Explanation

Carefully examine the first few observations in the new data set and consider how SAS creates them.

  1. Before executing the DATA step, SAS reads the descriptor portion of the two data sets and creates a program data vector that contains all variables from both data sets:

    • IdNumber, Name, and Salary from FINANCE

    • Play and Role from REPERTORY.

    IdNumber is already in the program data vector because it is in FINANCE. SAS sets the values of all variables to missing, as the following figure illustrates.

    Program Data Vector before Reading from Data Sets

    [Program Data Vector before Reading from Data Sets]

  2. SAS looks at the first BY group in each data set to determine which BY group should appear first. In this case, the first BY group, observations with the value 029-46-9261 for IdNumber, is the same in both data sets.

  3. SAS reads and copies the first observation from FINANCE into the program data vector, as the next figure illustrates.

    Program Data Vector after Reading FINANCE Data Set

    [Program Data Vector after Reading FINANCE Data Set]

  4. SAS reads and copies the first observation from REPERTORY into the program data vector, as the next figure illustrates. If a data set does not have any observations in a BY group, then the program data vector contains missing values for the variables that are unique to that data set.

    Program Data Vector after Reading REPERTORY Data Set

    [Program Data Vector after Reading REPERTORY Data Set]

  5. SAS writes the observation to the new data set and retains the values in the program data vector. (If the program data vector contained variables created by the DATA step, then SAS would set them to missing after writing to the new data set.)

  6. SAS looks for a second observation in the BY group in each data set. REPERTORY has one; FINANCE does not. The MERGE statement reads the second observation in the BY group from REPERTORY. Because FINANCE has only one observation in the BY group, the statement uses the values of Name (Rudelich , Herbert ) and Salary (35000 ) retained in the program data vector for the second observation in the new data set. The next figure illustrates this behavior.

    Program Data Vector with Second Observation in the BY Group

    [Program Data Vector with Second Observation in the BY Group]

  7. SAS writes the observation to the new data set. Neither data set contains any more observations in this BY group. Therefore, as the final figure illustrates, SAS sets all values in the program data vector to missing and begins processing the next BY group. It continues processing observations until it exhausts all observations in both data sets.

    Program Data Vector before New BY Groups

    [Program Data Vector before New BY Groups]


Match-Merging Data Sets with Dropped Variables

Now that casting decisions are final, the director wants to post the casting list, but does not want to include salary or employee ID information. As the next program illustrates, Salary and IdNumber can be eliminated by using the DROP= data set option when creating the new data set.

data newrep (drop=IdNumber);
   merge finance (drop=Salary) repertory;
   by IdNumber;
run;

proc print data=newrep;
   title 'Final Little Theater Season Casting Assignments';
run;

Note:   The difference in placement of the two DROP= data set options is crucial. Dropping IdNumber in the DATA statement means that the variable is available to the MERGE and BY statements (to which it is essential) but that it does not go into the new data set. Dropping Salary in the MERGE statement means that the MERGE statement does not even read this variable, so Salary is unavailable to the program statements. Because the variable Salary is not needed for processing, it is more efficient to prevent it from being read into the PDV in the first place.  [cautionend]

The following output displays the merged data set without the IdNumber and Salary variables:

Match-Merging Data Sets with Dropped Variables

                Final Little Theater Season Casting Assignments                1

   Obs    Name                      Play                   Role

     1    Rudelich, Herbert         The Glass Menagerie    Jim O'Connor     
     2    Rudelich, Herbert         The Dear Departed      Henry Slater     
     3    Vincent, Martina          No Exit                Estelle          
     4    Vincent, Martina          Happy Days             Winnie           
     5    Vincent, Martina          The Dear Departed      Mrs. Jordan      
     6    Benito, Gisela            The Glass Menagerie    Amanda Wingfield 
     7    Benito, Gisela            The Dear Departed      Mrs. Slater      
     8    Sirignano, Emily          The Dear Departed      Victoria Slater  
     9    Harbinger, Nicholas       No Exit                Garcin           
    10    Harbinger, Nicholas       Happy Days             Willie           
    11    Harbinger, Nicholas       The Dear Departed      Ben Jordan       
    12    Phillipon, Marie-Odile    No Exit                Inez             
    13    Phillipon, Marie-Odile    The Glass Menagerie    Laura Wingfield  
    14    Gunter, Thomas            No Exit                Valet            
    15    Gunter, Thomas            The Glass Menagerie    Tom Wingfield    
    16    Gunter, Thomas            The Dear Departed      Abel Merryweather

Match-Merging Data Sets with the Same Variables

You can match-merge data sets that contain the same variables (variables with the same name) by using the RENAME= data set option, just as you would when performing a one-to-one merge (see Performing a One-to-One Merge on Data Sets with the Same Variables).

If you do not use the RENAME= option and a variable exists in more than one data set, then the value of that variable in the last data set read is the value that goes into the new data set.


Match-Merging Data Sets That Lack a Common Variable

You can name any number of data sets in the MERGE statement. However, if you are match-merging the data sets, then you must be sure they all have a common variable and are sorted by that variable. If the data sets do not have a common variable, then you might be able to use another data set that has variables common to the original data sets to merge them.

For instance, consider the data sets that are used in the match-merge examples. The table that follows shows the names of the data sets and the names of the variables in each data set.

Data Set Variables
COMPANY Name, Age, Gender
FINANCE Name, IdNumber, Salary
REPERTORY Play, Role, IdNumber

These data sets do not share a common variable. However, COMPANY and FINANCE share the variable Name. Similarly, FINANCE and REPERTORY share the variable IdNumber. Therefore, as the next program shows, you can merge the data sets into one with two separate DATA steps. As usual, you must sort the data sets by the appropriate BY variable. (REPERTORY is already sorted by IdNumber.)

options linesize=120;
   /* Sort FINANCE and COMPANY by Name */
 proc sort data=finance;
   by Name;
run;

proc sort data=company;
   by Name;
run;

   /* Merge COMPANY and FINANCE into a */
   /* temporary data set.              */
data temp;
   merge company finance;
   by Name;
run;

proc sort data=temp;
   by IdNumber;
run;

   /* Merge the temporary data set with REPERTORY */
data all;
   merge temp repertory;
   by IdNumber;
run;

proc print data=all;
   title 'Little Theater Complete Casting Information';
run;

In order to merge the three data sets, this program

The following output displays the resulting data set, ALL:

Match-Merging Data Sets That Lack a Common Variable

                                       Little Theater Complete Casting Information                                      1

  Obs    Name                      Age    Gender     IdNumber      Salary    Play                   Role

    1    Morrison, Michael          32      M                           .                                            
    2    Rudelich, Herbert          39      M       029-46-9261     35000    The Glass Menagerie    Jim O'Connor     
    3    Rudelich, Herbert          39      M       029-46-9261     35000    The Dear Departed      Henry Slater     
    4    Vincent, Martina           34      F       074-53-9892     35000    No Exit                Estelle          
    5    Vincent, Martina           34      F       074-53-9892     35000    Happy Days             Winnie           
    6    Vincent, Martina           34      F       074-53-9892     35000    The Dear Departed      Mrs. Jordan      
    7    Benito, Gisela             32      F       228-88-9649     28000    The Glass Menagerie    Amanda Wingfield 
    8    Benito, Gisela             32      F       228-88-9649     28000    The Dear Departed      Mrs. Slater      
    9    Sirignano, Emily           12      F       442-21-8075      5000    The Dear Departed      Victoria Slater  
   10    Harbinger, Nicholas        36      M       446-93-2122     33900    No Exit                Garcin           
   11    Harbinger, Nicholas        36      M       446-93-2122     33900    Happy Days             Willie           
   12    Harbinger, Nicholas        36      M       446-93-2122     33900    The Dear Departed      Ben Jordan       
   13    Phillipon, Marie-Odile     28      F       776-84-5391     29750    No Exit                Inez             
   14    Phillipon, Marie-Odile     28      F       776-84-5391     29750    The Glass Menagerie    Laura Wingfield  
   15    Gunter, Thomas             27      M       929-75-0218     27500    No Exit                Valet            
   16    Gunter, Thomas             27      M       929-75-0218     27500    The Glass Menagerie    Tom Wingfield    
   17    Gunter, Thomas             27      M       929-75-0218     27500    The Dear Departed      Abel Merryweather

Previous Page | Next Page | Top of Page