Merging SAS Data Sets

Choosing between One-to-One Merging and Match-Merging

Use one-to-one merging when you want to combine one observation from each data set, but it is not important to match observations. For example, when merging an observation that contains a student's name, year, and major with an observation that contains a date, time, and location for a conference, it does not matter which student gets which time slot; therefore, a one-to-one merge is appropriate.

In cases where you must merge certain observations, use a match-merge. For example, when merging employee information from two different data sets, it is crucial that you merge observations that relate to the same employee. Therefore, you must use a match-merge.

Sometimes you might want to merge by a particular variable, but your data is arranged in such a way that you can see that a one-to-one merge will work. The next example illustrates a case when you could use a one-to-one merge for matching observations because you are certain that your data is ordered correctly. However, as a subsequent example shows, it is risky to use a one-to-one merge in such situations.

Input SAS Data Set for Examples

Consider the data set COMPANY2. Each observation in this data set corresponds to an observation with the same value of Name in FINANCE. The program that follows creates and displays COMPANY2; it also displays FINANCE for comparison.

data company2;
   input name $ 1-25 age 27-28 gender $ 30;
   datalines;
Benito, Gisela            32 F
Gunter, Thomas            27 M
Harbinger, Nicholas       36 M
Phillipon, Marie-Odile    28 F
Rudelich, Herbert         39 M
Sirignano, Emily          12 F
Vincent, Martina          34 F
;

proc print data=company2;
   title 'Little Theater Company Roster';
run;

proc print data=finance;
   title 'Little Theater Employee Information';
run;

The following output displays the two data sets:

The COMPANY2 and FINANCE Data Sets

                         Little Theater Company Roster                         1

                  Obs    name                      age   gender

                   1     Benito, Gisela             32     F 
                   2     Gunter, Thomas             27     M 
                   3     Harbinger, Nicholas        36     M 
                   4     Phillipon, Marie-Odile     28     F 
                   5     Rudelich, Herbert          39     M 
                   6     Sirignano, Emily           12     F 
                   7     Vincent, Martina           34     F

                      Little Theater Employee Information                      2

             Obs     IdNumber      Name                      Salary

              1     228-88-9649    Benito, Gisela             28000
              2     929-75-0218    Gunter, Thomas             27500
              3     446-93-2122    Harbinger, Nicholas        33900
              4     776-84-5391    Phillipon, Marie-Odile     29750
              5     029-46-9261    Rudelich, Herbert          35000
              6     442-21-8075    Sirignano, Emily            5000
              7     074-53-9892    Vincent, Martina           35000

When to Use a One-to-One Merge

The following program shows that because both data sets are sorted by NAME and because each observation in one data set has a corresponding observation in the other data set, a one-to-one merge has the same result as merging by Name.

   /* One-to-one merge */
data one_to_one;
   merge company2 finance;
run;

proc print data=one_to_one;
   title 'Using a One-to-One Merge to Combine';
   title2 'COMPANY2 and FINANCE';
run;

   /* Match-merge */
data match;
   merge company2 finance;
   by name;
run;

proc print data=match;
   title 'Using a Match-Merge to Combine';
   title2 'COMPANY2 and FINANCE';
run;

The following output displays the results of the two merges. You can see that they are identical.

Comparing a One-to-One Merge with a Match-Merge When Observations Correspond

                      Using a One-to-One Merge to Combine                      1
                              COMPANY2 and FINANCE

      Obs    name                      age  gender    IdNumber      Salary

       1     Benito, Gisela             32     F     228-88-9649     28000
       2     Gunter, Thomas             27     M     929-75-0218     27500
       3     Harbinger, Nicholas        36     M     446-93-2122     33900
       4     Phillipon, Marie-Odile     28     F     776-84-5391     29750
       5     Rudelich, Herbert          39     M     029-46-9261     35000
       6     Sirignano, Emily           12     F     442-21-8075      5000
       7     Vincent, Martina           34     F     074-53-9892     35000

                         Using a Match-Merge to Combine                        2
                              COMPANY2 and FINANCE

      Obs    name                      age  gender    IdNumber      Salary

       1     Benito, Gisela             32     F     228-88-9649     28000
       2     Gunter, Thomas             27     M     929-75-0218     27500
       3     Harbinger, Nicholas        36     M     446-93-2122     33900
       4     Phillipon, Marie-Odile     28     F     776-84-5391     29750
       5     Rudelich, Herbert          39     M     029-46-9261     35000
       6     Sirignano, Emily           12     F     442-21-8075      5000
       7     Vincent, Martina           34     F     074-53-9892     35000

Even though the resulting data sets are identical, it is not wise to use a one-to-one merge when it is essential to merge a particular observation from one data set with a particular observation from another data set.

When to Use a Match-Merge

In the previous example, you can easily determine that the data sets contain the same values for Name and that the values appear in the same order. However, if the data sets contained hundreds of observations, then it would be difficult to ascertain that all the values match. If the observations do not match, then serious problems can occur. The next example illustrates why you should not use a one-to-one merge for matching observations.

Consider the original data set, COMPANY, which contains an observation for Michael Morrison (see The COMPANY and FINANCE Data Sets). FINANCE has no corresponding observation. If a programmer did not realize this fact and tried to use the following program to perform a one-to-one merge with FINANCE, then several problems could appear.

data badmerge;
   merge company finance;
run;

proc print data=badmerge;
   title 'Using a One-to-One Merge Instead of a Match-Merge';
run;

The following output shows the potential problems:

One-to-One Merge with Unequal Numbers of Observations in Each Data Set

               Using a One-to-One Merge Instead of a Match-Merge               1

    Obs    Name                      Age    Gender     IdNumber      Salary

     1     Benito, Gisela             32      F       228-88-9649     28000
     2     Gunter, Thomas             27      M       929-75-0218     27500
     3     Harbinger, Nicholas        36      M       446-93-2122     33900
     4     Phillipon, Marie-Odile     32      M       776-84-5391     29750
     5     Rudelich, Herbert          28      F       029-46-9261     35000
     6     Sirignano, Emily           39      M       442-21-8075      5000
     7     Vincent, Martina           12      F       074-53-9892     35000
     8     Vincent, Martina           34      F                           .

The first three observations merge correctly. However, FINANCE does not have an observation for Michael Morrison. A one-to-one merge makes no attempt to match parts of the observations from the different data sets. It simply combines observations based on their positions in the data sets that you name in the MERGE statement. Therefore, the fourth observation in BADMERGE combines the fourth observation in COMPANY (Michael's name, age, and gender) with the fourth observation in FINANCE (Marie-Odile's name, employee ID number, and salary). As SAS combines the observations, Marie-Odile's name overwrites Michael's. After writing this observation to the new data set, SAS processes the next observation in each data set. These observations are similarly mismatched.

This type of mismatch continues until the seventh observation when the MERGE statement exhausts the observations in the smaller data set, FINANCE. After writing the seventh observation to the new data set, SAS begins the next iteration of the DATA step. Because SAS has read all observations in FINANCE, it sets the values for variables from that data set to missing in the program data vector. Then it reads the values for Name, Age, and Gender from COMPANY and writes the contents of the program data vector to the new data set. Therefore, the last observation has the same value for NAME as the previous observation and contains missing values for IdNumber and Salary.

These missing values and the duplication of the value for Name might make you suspect that the observations did not merge as you intended them to. However, if instead of being an additional observation, the observation for Michael Morrison replaced another observation in COMPANY2, then no observations would have missing values, and the problem would not be as easy to spot. Therefore, you are safer using a match-merge in situations that call for it even if you think the data is arranged so that a one-to-one merge will have the same results.

Top of Page