Merging SAS Data Sets |
Comparing Match-Merge Methods |
Use one-to-one merging when you want to combine one observation from each data set, but it is not important to match observations. For example, when merging an observation that contains a student's name, year, and major with an observation that contains a date, time, and location for a conference, it does not matter which student gets which time slot; therefore, a one-to-one merge is appropriate.
In cases where you must merge certain observations, use a match-merge. For example, when merging employee information from two different data sets, it is crucial that you merge observations that relate to the same employee. Therefore, you must use a match-merge.
Sometimes you might want to merge by a particular variable, but your data is arranged in such a way that you can see that a one-to-one merge will work. The next example illustrates a case when you could use a one-to-one merge for matching observations because you are certain that your data is ordered correctly. However, as a subsequent example shows, it is risky to use a one-to-one merge in such situations.
Input SAS Data Set for Examples |
Consider the data set COMPANY2. Each observation in this data set corresponds to an observation with the same value of Name in FINANCE. The program that follows creates and displays COMPANY2; it also displays FINANCE for comparison.
data company2; input name $ 1-25 age 27-28 gender $ 30; datalines; Benito, Gisela 32 F Gunter, Thomas 27 M Harbinger, Nicholas 36 M Phillipon, Marie-Odile 28 F Rudelich, Herbert 39 M Sirignano, Emily 12 F Vincent, Martina 34 F ;
proc print data=company2; title 'Little Theater Company Roster'; run; proc print data=finance; title 'Little Theater Employee Information'; run;
The following output displays the two data sets:
The COMPANY2 and FINANCE Data Sets
Little Theater Company Roster 1 Obs name age gender 1 Benito, Gisela 32 F 2 Gunter, Thomas 27 M 3 Harbinger, Nicholas 36 M 4 Phillipon, Marie-Odile 28 F 5 Rudelich, Herbert 39 M 6 Sirignano, Emily 12 F 7 Vincent, Martina 34 F
Little Theater Employee Information 2 Obs IdNumber Name Salary 1 228-88-9649 Benito, Gisela 28000 2 929-75-0218 Gunter, Thomas 27500 3 446-93-2122 Harbinger, Nicholas 33900 4 776-84-5391 Phillipon, Marie-Odile 29750 5 029-46-9261 Rudelich, Herbert 35000 6 442-21-8075 Sirignano, Emily 5000 7 074-53-9892 Vincent, Martina 35000
When to Use a One-to-One Merge |
The following program shows that because both data sets are sorted by NAME and because each observation in one data set has a corresponding observation in the other data set, a one-to-one merge has the same result as merging by Name.
/* One-to-one merge */ data one_to_one; merge company2 finance; run; proc print data=one_to_one; title 'Using a One-to-One Merge to Combine'; title2 'COMPANY2 and FINANCE'; run; /* Match-merge */ data match; merge company2 finance; by name; run; proc print data=match; title 'Using a Match-Merge to Combine'; title2 'COMPANY2 and FINANCE'; run;
The following output displays the results of the two merges. You can see that they are identical.
Comparing a One-to-One Merge with a Match-Merge When Observations Correspond
Using a One-to-One Merge to Combine 1 COMPANY2 and FINANCE Obs name age gender IdNumber Salary 1 Benito, Gisela 32 F 228-88-9649 28000 2 Gunter, Thomas 27 M 929-75-0218 27500 3 Harbinger, Nicholas 36 M 446-93-2122 33900 4 Phillipon, Marie-Odile 28 F 776-84-5391 29750 5 Rudelich, Herbert 39 M 029-46-9261 35000 6 Sirignano, Emily 12 F 442-21-8075 5000 7 Vincent, Martina 34 F 074-53-9892 35000
Using a Match-Merge to Combine 2 COMPANY2 and FINANCE Obs name age gender IdNumber Salary 1 Benito, Gisela 32 F 228-88-9649 28000 2 Gunter, Thomas 27 M 929-75-0218 27500 3 Harbinger, Nicholas 36 M 446-93-2122 33900 4 Phillipon, Marie-Odile 28 F 776-84-5391 29750 5 Rudelich, Herbert 39 M 029-46-9261 35000 6 Sirignano, Emily 12 F 442-21-8075 5000 7 Vincent, Martina 34 F 074-53-9892 35000
Even though the resulting data sets are identical, it is not wise to use a one-to-one merge when it is essential to merge a particular observation from one data set with a particular observation from another data set.
When to Use a Match-Merge |
In the previous example, you can easily determine that the data sets contain the same values for Name and that the values appear in the same order. However, if the data sets contained hundreds of observations, then it would be difficult to ascertain that all the values match. If the observations do not match, then serious problems can occur. The next example illustrates why you should not use a one-to-one merge for matching observations.
Consider the original data set, COMPANY, which contains an observation for Michael Morrison (see The COMPANY and FINANCE Data Sets). FINANCE has no corresponding observation. If a programmer did not realize this fact and tried to use the following program to perform a one-to-one merge with FINANCE, then several problems could appear.
data badmerge; merge company finance; run; proc print data=badmerge; title 'Using a One-to-One Merge Instead of a Match-Merge'; run;
The following output shows the potential problems:
One-to-One Merge with Unequal Numbers of Observations in Each Data Set
Using a One-to-One Merge Instead of a Match-Merge 1 Obs Name Age Gender IdNumber Salary 1 Benito, Gisela 32 F 228-88-9649 28000 2 Gunter, Thomas 27 M 929-75-0218 27500 3 Harbinger, Nicholas 36 M 446-93-2122 33900 4 Phillipon, Marie-Odile 32 M 776-84-5391 29750 5 Rudelich, Herbert 28 F 029-46-9261 35000 6 Sirignano, Emily 39 M 442-21-8075 5000 7 Vincent, Martina 12 F 074-53-9892 35000 8 Vincent, Martina 34 F .
The first three observations merge correctly. However, FINANCE does not have an observation for Michael Morrison. A one-to-one merge makes no attempt to match parts of the observations from the different data sets. It simply combines observations based on their positions in the data sets that you name in the MERGE statement. Therefore, the fourth observation in BADMERGE combines the fourth observation in COMPANY (Michael's name, age, and gender) with the fourth observation in FINANCE (Marie-Odile's name, employee ID number, and salary). As SAS combines the observations, Marie-Odile's name overwrites Michael's. After writing this observation to the new data set, SAS processes the next observation in each data set. These observations are similarly mismatched.
This type of mismatch continues until the seventh observation when the MERGE statement exhausts the observations in the smaller data set, FINANCE. After writing the seventh observation to the new data set, SAS begins the next iteration of the DATA step. Because SAS has read all observations in FINANCE, it sets the values for variables from that data set to missing in the program data vector. Then it reads the values for Name, Age, and Gender from COMPANY and writes the contents of the program data vector to the new data set. Therefore, the last observation has the same value for NAME as the previous observation and contains missing values for IdNumber and Salary.
These missing values and the duplication of the value for Name might make you suspect that the observations did not merge as you intended them to. However, if instead of being an additional observation, the observation for Michael Morrison replaced another observation in COMPANY2, then no observations would have missing values, and the problem would not be as easy to spot. Therefore, you are safer using a match-merge in situations that call for it even if you think the data is arranged so that a one-to-one merge will have the same results.
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.