Merging SAS Data Sets |
Merging with a BY Statement |
Merging with a BY statement enables you to match observations according to the values of the BY variables that you specify. Before you can perform a match-merge, all data sets must be sorted by the variables that you want to use for the merge.
In order to understand match-merging, you must understand three key concepts:
Input SAS Data Set for Examples |
For example, the director of a small repertory theater company, the Little Theater, maintains company records in two SAS data sets, COMPANY and FINANCE.
Data Set | Variable | Description |
COMPANY | Name | player's name |
Age | player's age | |
Gender | player's gender | |
FINANCE | Name | player's name |
IdNumber | player's employee ID number | |
Salary | player's annual salary |
The following program creates, sorts, and displays COMPANY and FINANCE:
data company; input Name $ 1-25 Age 27-28 Gender $ 30; datalines; Vincent, Martina 34 F Phillipon, Marie-Odile 28 F Gunter, Thomas 27 M Harbinger, Nicholas 36 M Benito, Gisela 32 F Rudelich, Herbert 39 M Sirignano, Emily 12 F Morrison, Michael 32 M ; proc sort data=company; by Name; run; data finance; input IdNumber $ 1-11 Name $ 13-40 Salary; datalines; 074-53-9892 Vincent, Martina 35000 776-84-5391 Phillipon, Marie-Odile 29750 929-75-0218 Gunter, Thomas 27500 446-93-2122 Harbinger, Nicholas 33900 228-88-9649 Benito, Gisela 28000 029-46-9261 Rudelich, Herbert 35000 442-21-8075 Sirignano, Emily 5000 ; proc sort data=finance; by Name; run;
proc print data=company; title 'Little Theater Company Roster'; run; proc print data=finance; title 'Little Theater Employee Information'; run;
The following output displays the data sets. Notice that the FINANCE data set does not contain an observation for Michael Morrison.
The COMPANY and FINANCE Data Sets
Little Theater Company Roster 1 Obs Name Age Gender 1 Benito, Gisela 32 F 2 Gunter, Thomas 27 M 3 Harbinger, Nicholas 36 M 4 Morrison, Michael 32 M 5 Phillipon, Marie-Odile 28 F 6 Rudelich, Herbert 39 M 7 Sirignano, Emily 12 F 8 Vincent, Martina 34 F
Little Theater Employee Information 2 Obs IdNumber Name Salary 1 228-88-9649 Benito, Gisela 28000 2 929-75-0218 Gunter, Thomas 27500 3 446-93-2122 Harbinger, Nicholas 33900 4 776-84-5391 Phillipon, Marie-Odile 29750 5 029-46-9261 Rudelich, Herbert 35000 6 442-21-8075 Sirignano, Emily 5000 7 074-53-9892 Vincent, Martina 35000
The Program |
To avoid having to maintain two separate data sets, the director wants to merge the records for each player from both data sets into a new data set that contains all the variables. The variable that is common to both data sets is Name. Therefore, Name is the appropriate BY variable.
The data sets are already sorted by NAME, so no further sorting is required. The following program merges them by NAME:
data employee_info; merge company finance; by name; run; proc print data=employee_info; title 'Little Theater Employee Information'; title2 '(including personal and financial information)'; run;
The following output displays the merged data set:
Little Theater Employee Information 1 (including personal and financial information) Obs Name Age Gender IdNumber Salary 1 Benito, Gisela 32 F 228-88-9649 28000 2 Gunter, Thomas 27 M 929-75-0218 27500 3 Harbinger, Nicholas 36 M 446-93-2122 33900 4 Morrison, Michael 32 M . 5 Phillipon, Marie-Odile 28 F 776-84-5391 29750 6 Rudelich, Herbert 39 M 029-46-9261 35000 7 Sirignano, Emily 12 F 442-21-8075 5000 8 Vincent, Martina 34 F 074-53-9892 35000
Explanation |
The new data set contains one observation for each player in the company. Each observation contains all the variables from both data sets. Notice in particular the fourth observation. The data set FINANCE does not have an observation for Michael Morrison. In this case, the values of the variables that are unique to FINANCE (IdNumber and Salary) are missing.
Match-Merging Data Sets with Multiple Observations in a BY Group |
The Little Theater has a third data set, REPERTORY, that tracks the casting assignments in each of the season's plays. REPERTORY contains these variables:
Play | |
Role | |
IdNumber |
The following program creates and displays REPERTORY:
data repertory; input Play $ 1-23 Role $ 25-48 IdNumber $ 50-60; datalines; No Exit Estelle 074-53-9892 No Exit Inez 776-84-5391 No Exit Valet 929-75-0218 No Exit Garcin 446-93-2122 Happy Days Winnie 074-53-9892 Happy Days Willie 446-93-2122 The Glass Menagerie Amanda Wingfield 228-88-9649 The Glass Menagerie Laura Wingfield 776-84-5391 The Glass Menagerie Tom Wingfield 929-75-0218 The Glass Menagerie Jim O'Connor 029-46-9261 The Dear Departed Mrs. Slater 228-88-9649 The Dear Departed Mrs. Jordan 074-53-9892 The Dear Departed Henry Slater 029-46-9261 The Dear Departed Ben Jordan 446-93-2122 The Dear Departed Victoria Slater 442-21-8075 The Dear Departed Abel Merryweather 929-75-0218 ;
proc print data=repertory; title 'Little Theater Season Casting Assignments'; run;
The following output displays the REPERTORY data set:
Little Theater Season Casting Assignments 1 Obs Play Role IdNumber 1 No Exit Estelle 074-53-9892 2 No Exit Inez 776-84-5391 3 No Exit Valet 929-75-0218 4 No Exit Garcin 446-93-2122 5 Happy Days Winnie 074-53-9892 6 Happy Days Willie 446-93-2122 7 The Glass Menagerie Amanda Wingfield 228-88-9649 8 The Glass Menagerie Laura Wingfield 776-84-5391 9 The Glass Menagerie Tom Wingfield 929-75-0218 10 The Glass Menagerie Jim O'Connor 029-46-9261 11 The Dear Departed Mrs. Slater 228-88-9649 12 The Dear Departed Mrs. Jordan 074-53-9892 13 The Dear Departed Henry Slater 029-46-9261 14 The Dear Departed Ben Jordan 446-93-2122 15 The Dear Departed Victoria Slater 442-21-8075 16 The Dear Departed Abel Merryweather 929-75-0218
To maintain confidentiality during preliminary casting, this data set identifies players by employee ID number. However, casting decisions are now final, and the manager wants to replace each employee ID number with the player's name. Of course, it is possible to re-create the data set, entering each player's name instead of the employee ID number in the raw data. However, it is more efficient to make use of the data set FINANCE, which already contains the name and employee ID number of all players (see The COMPANY and FINANCE Data Sets). When the data sets are merged, SAS takes care of adding the players' names to the data set.
Of course, before you can merge the data sets, you must sort them by IdNumber.
proc sort data=finance; by IdNumber; run; proc sort data=repertory; by IdNumber; run;
proc print data=finance; title 'Little Theater Employee Information'; title2 '(sorted by employee ID number)'; run; proc print data=repertory; title 'Little Theater Season Casting Assignments'; title2 '(sorted by employee ID number)'; run;
The following output displays the FINANCE and REPERTORY data sets, sorted by IdNumber:
Sorting the FINANCE and REPERTORY Data Sets by IdNumber
Little Theater Employee Information 1 (sorted by employee ID number) Obs IdNumber Name Salary 1 029-46-9261 Rudelich, Herbert 35000 2 074-53-9892 Vincent, Martina 35000 3 228-88-9649 Benito, Gisela 28000 4 442-21-8075 Sirignano, Emily 5000 5 446-93-2122 Harbinger, Nicholas 33900 6 776-84-5391 Phillipon, Marie-Odile 29750 7 929-75-0218 Gunter, Thomas 27500
Little Theater Season Casting Assignments 2 (sorted by employee ID number) Obs Play Role IdNumber 1 The Glass Menagerie Jim O'Connor 029-46-9261 2 The Dear Departed Henry Slater 029-46-9261 3 No Exit Estelle 074-53-9892 4 Happy Days Winnie 074-53-9892 5 The Dear Departed Mrs. Jordan 074-53-9892 6 The Glass Menagerie Amanda Wingfield 228-88-9649 7 The Dear Departed Mrs. Slater 228-88-9649 8 The Dear Departed Victoria Slater 442-21-8075 9 No Exit Garcin 446-93-2122 10 Happy Days Willie 446-93-2122 11 The Dear Departed Ben Jordan 446-93-2122 12 No Exit Inez 776-84-5391 13 The Glass Menagerie Laura Wingfield 776-84-5391 14 No Exit Valet 929-75-0218 15 The Glass Menagerie Tom Wingfield 929-75-0218 16 The Dear Departed Abel Merryweather 929-75-0218
These two data sets contain seven BY groups; that is, among the 23 observations are seven different values for the BY variable, IdNumber. The first BY group has a value of 029-46-9261 for IdNumber. FINANCE has one observation in this BY group; REPERTORY has two. The last BY group has a value of 929-75-0218 for IdNumber. FINANCE has one observation in this BY group; REPERTORY has three.
The following program merges the data sets FINANCE and REPERTORY and illustrates what happens when a BY group in one data set has more observations in it than the same BY group in the other data set.
The resulting data set contains all variables from both data sets.
options linesize=120; data repertory_name; merge finance repertory; by IdNumber; run; proc print data=repertory_name; title 'Little Theater Season Casting Assignments'; title2 'with employee financial information'; run;
Note: The OPTIONS statement extends the line size to 120 so that PROC PRINT can display all variables on one line. Most output in this section is created with line size set to 76 in the OPTIONS statement. An OPTIONS statement appears only in examples using a different line size. When you set the LINESIZE= option, it remains in effect until you reset it or end the SAS session.
The following output displays the merged data set:
Match-Merge with Multiple Observations in a BY Group
Little Theater Season Casting Assignments 1 with employee financial information Obs IdNumber Name Salary Play Role 1 029-46-9261 Rudelich, Herbert 35000 The Glass Menagerie Jim O'Connor 2 029-46-9261 Rudelich, Herbert 35000 The Dear Departed Henry Slater 3 074-53-9892 Vincent, Martina 35000 No Exit Estelle 4 074-53-9892 Vincent, Martina 35000 Happy Days Winnie 5 074-53-9892 Vincent, Martina 35000 The Dear Departed Mrs. Jordan 6 228-88-9649 Benito, Gisela 28000 The Glass Menagerie Amanda Wingfield 7 228-88-9649 Benito, Gisela 28000 The Dear Departed Mrs. Slater 8 442-21-8075 Sirignano, Emily 5000 The Dear Departed Victoria Slater 9 446-93-2122 Harbinger, Nicholas 33900 No Exit Garcin 10 446-93-2122 Harbinger, Nicholas 33900 Happy Days Willie 11 446-93-2122 Harbinger, Nicholas 33900 The Dear Departed Ben Jordan 12 776-84-5391 Phillipon, Marie-Odile 29750 No Exit Inez 13 776-84-5391 Phillipon, Marie-Odile 29750 The Glass Menagerie Laura Wingfield 14 929-75-0218 Gunter, Thomas 27500 No Exit Valet 15 929-75-0218 Gunter, Thomas 27500 The Glass Menagerie Tom Wingfield 16 929-75-0218 Gunter, Thomas 27500 The Dear Departed Abel Merryweather
Carefully examine the first few observations in the new data set and consider how SAS creates them.
Before executing the DATA step, SAS reads the descriptor portion of the two data sets and creates a program data vector that contains all variables from both data sets:
IdNumber is already in the program data vector because it is in FINANCE. SAS sets the values of all variables to missing, as the following figure illustrates.Program Data Vector before Reading from Data Sets
SAS looks at the first BY group in each data set to determine which BY group should appear first. In this case, the first BY group, observations with the value 029-46-9261 for IdNumber, is the same in both data sets.
SAS reads and copies the first observation from FINANCE into the program data vector, as the next figure illustrates.
Program Data Vector after Reading FINANCE Data Set
SAS reads and copies the first observation from REPERTORY into the program data vector, as the next figure illustrates. If a data set does not have any observations in a BY group, then the program data vector contains missing values for the variables that are unique to that data set.
Program Data Vector after Reading REPERTORY Data Set
SAS writes the observation to the new data set and retains the values in the program data vector. (If the program data vector contained variables created by the DATA step, then SAS would set them to missing after writing to the new data set.)
SAS looks for a second observation in the BY group in each data set. REPERTORY has one; FINANCE does not. The MERGE statement reads the second observation in the BY group from REPERTORY. Because FINANCE has only one observation in the BY group, the statement uses the values of Name (Rudelich , Herbert ) and Salary (35000 ) retained in the program data vector for the second observation in the new data set. The next figure illustrates this behavior.
Program Data Vector with Second Observation in the BY Group
SAS writes the observation to the new data set. Neither data set contains any more observations in this BY group. Therefore, as the final figure illustrates, SAS sets all values in the program data vector to missing and begins processing the next BY group. It continues processing observations until it exhausts all observations in both data sets.
Program Data Vector before New BY Groups
Match-Merging Data Sets with Dropped Variables |
Now that casting decisions are final, the director wants to post the casting list, but does not want to include salary or employee ID information. As the next program illustrates, Salary and IdNumber can be eliminated by using the DROP= data set option when creating the new data set.
data newrep (drop=IdNumber); merge finance (drop=Salary) repertory; by IdNumber; run; proc print data=newrep; title 'Final Little Theater Season Casting Assignments'; run;
Note: The difference in placement of the two DROP= data set options is crucial. Dropping IdNumber in the DATA statement means that the variable is available to the MERGE and BY statements (to which it is essential) but that it does not go into the new data set. Dropping Salary in the MERGE statement means that the MERGE statement does not even read this variable, so Salary is unavailable to the program statements. Because the variable Salary is not needed for processing, it is more efficient to prevent it from being read into the PDV in the first place.
The following output displays the merged data set without the IdNumber and Salary variables:Match-Merging Data Sets with Dropped Variables
Final Little Theater Season Casting Assignments 1 Obs Name Play Role 1 Rudelich, Herbert The Glass Menagerie Jim O'Connor 2 Rudelich, Herbert The Dear Departed Henry Slater 3 Vincent, Martina No Exit Estelle 4 Vincent, Martina Happy Days Winnie 5 Vincent, Martina The Dear Departed Mrs. Jordan 6 Benito, Gisela The Glass Menagerie Amanda Wingfield 7 Benito, Gisela The Dear Departed Mrs. Slater 8 Sirignano, Emily The Dear Departed Victoria Slater 9 Harbinger, Nicholas No Exit Garcin 10 Harbinger, Nicholas Happy Days Willie 11 Harbinger, Nicholas The Dear Departed Ben Jordan 12 Phillipon, Marie-Odile No Exit Inez 13 Phillipon, Marie-Odile The Glass Menagerie Laura Wingfield 14 Gunter, Thomas No Exit Valet 15 Gunter, Thomas The Glass Menagerie Tom Wingfield 16 Gunter, Thomas The Dear Departed Abel Merryweather
Match-Merging Data Sets with the Same Variables |
You can match-merge data sets that contain the same variables (variables with the same name) by using the RENAME= data set option, just as you would when performing a one-to-one merge (see Performing a One-to-One Merge on Data Sets with the Same Variables).
If you do not use the RENAME= option and a variable exists in more than one data set, then the value of that variable in the last data set read is the value that goes into the new data set.
Match-Merging Data Sets That Lack a Common Variable |
You can name any number of data sets in the MERGE statement. However, if you are match-merging the data sets, then you must be sure they all have a common variable and are sorted by that variable. If the data sets do not have a common variable, then you might be able to use another data set that has variables common to the original data sets to merge them.
For instance, consider the data sets that are used in the match-merge examples. The table that follows shows the names of the data sets and the names of the variables in each data set.
Data Set | Variables |
---|---|
COMPANY | Name, Age, Gender |
FINANCE | Name, IdNumber, Salary |
REPERTORY | Play, Role, IdNumber |
These data sets do not share a common variable. However, COMPANY and FINANCE share the variable Name. Similarly, FINANCE and REPERTORY share the variable IdNumber. Therefore, as the next program shows, you can merge the data sets into one with two separate DATA steps. As usual, you must sort the data sets by the appropriate BY variable. (REPERTORY is already sorted by IdNumber.)
options linesize=120; /* Sort FINANCE and COMPANY by Name */ proc sort data=finance; by Name; run; proc sort data=company; by Name; run; /* Merge COMPANY and FINANCE into a */ /* temporary data set. */ data temp; merge company finance; by Name; run; proc sort data=temp; by IdNumber; run; /* Merge the temporary data set with REPERTORY */ data all; merge temp repertory; by IdNumber; run; proc print data=all; title 'Little Theater Complete Casting Information'; run;
In order to merge the three data sets, this program
Match-Merging Data Sets That Lack a Common Variable
Little Theater Complete Casting Information 1 Obs Name Age Gender IdNumber Salary Play Role 1 Morrison, Michael 32 M . 2 Rudelich, Herbert 39 M 029-46-9261 35000 The Glass Menagerie Jim O'Connor 3 Rudelich, Herbert 39 M 029-46-9261 35000 The Dear Departed Henry Slater 4 Vincent, Martina 34 F 074-53-9892 35000 No Exit Estelle 5 Vincent, Martina 34 F 074-53-9892 35000 Happy Days Winnie 6 Vincent, Martina 34 F 074-53-9892 35000 The Dear Departed Mrs. Jordan 7 Benito, Gisela 32 F 228-88-9649 28000 The Glass Menagerie Amanda Wingfield 8 Benito, Gisela 32 F 228-88-9649 28000 The Dear Departed Mrs. Slater 9 Sirignano, Emily 12 F 442-21-8075 5000 The Dear Departed Victoria Slater 10 Harbinger, Nicholas 36 M 446-93-2122 33900 No Exit Garcin 11 Harbinger, Nicholas 36 M 446-93-2122 33900 Happy Days Willie 12 Harbinger, Nicholas 36 M 446-93-2122 33900 The Dear Departed Ben Jordan 13 Phillipon, Marie-Odile 28 F 776-84-5391 29750 No Exit Inez 14 Phillipon, Marie-Odile 28 F 776-84-5391 29750 The Glass Menagerie Laura Wingfield 15 Gunter, Thomas 27 M 929-75-0218 27500 No Exit Valet 16 Gunter, Thomas 27 M 929-75-0218 27500 The Glass Menagerie Tom Wingfield 17 Gunter, Thomas 27 M 929-75-0218 27500 The Dear Departed Abel Merryweather
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.