Merging SAS Data Sets |
Definition of One-to-One Merging |
When you use the MERGE statement without a BY statement, SAS combines the first observation in all data sets you name in the MERGE statement into the first observation in the new data set, the second observation in all data sets into the second observation in the new data set, and so on. In a one-to-one merge, the number of observations in the new data set is equal to the number of observations in the largest data set you name in the MERGE statement.
Performing a Simple One-to-One Merge |
For example, the instructor of a college acting class wants to schedule a conference with each student. One data set, CLASS, contains these variables:
Name | |
Year | |
Major |
is the student's area of specialization. This value is always missing for first-year and second-year students, who have not selected a major subject yet. |
The following program creates and displays the data set CLASS:
data class; input Name $ 1-25 Year $ 26-34 Major $ 36-50; datalines; Abbott, Jennifer first Carter, Tom third Theater Kirby, Elissa fourth Mathematics Tucker, Rachel first Uhl, Roland second Wacenske, Maurice third Theater ;
proc print data=class; title 'Acting Class Roster'; run;
The following output displays the data set CLASS:
Acting Class Roster 1 Obs Name Year Major 1 Abbott, Jennifer first 2 Carter, Tom third Theater 3 Kirby, Elissa fourth Mathematics 4 Tucker, Rachel first 5 Uhl, Roland second 6 Wacenske, Maurice third Theater
A second data set contains a list of the dates and times the instructor has scheduled conferences and the rooms in which the conferences are to take place. The following program creates and displays the data set TIME_SLOT. Note the use of the date format and informat.
data time_slot; input Date date9. @12 Time $ @19 Room $; format date date9.; datalines; 14sep2000 10:00 103 14sep2000 10:30 103 14sep2000 11:00 207 15sep2000 10:00 105 15sep2000 10:30 105 17sep2000 11:00 207 ;
proc print data=time_slot; title 'Dates, Times, and Locations of Conferences'; run;
The following output displays the data set TIME_SLOT:
Dates, Times, and Locations of Conferences 1 Obs Date Time Room 1 14SEP2000 10:00 103 2 14SEP2000 10:30 103 3 14SEP2000 11:00 207 4 15SEP2000 10:00 105 5 15SEP2000 10:30 105 6 17SEP2000 11:00 207
The following program performs a one-to-one merge of these data sets, assigning a time slot for a conference to each student in the class.
data schedule; merge class time_slot; run; proc print data=schedule; title 'Student Conference Assignments'; run;
The following output displays the conference schedule data set:
Student Conference Assignments 1 Obs Name Year Major Date Time Room 1 Abbott, Jennifer first 14SEP2000 10:00 103 2 Carter, Tom third Theater 14SEP2000 10:30 103 3 Kirby, Elissa fourth Mathematics 14SEP2000 11:00 207 4 Tucker, Rachel first 15SEP2000 10:00 105 5 Uhl, Roland second 15SEP2000 10:30 105 6 Wacenske, Maurice third Theater 17SEP2000 11:00 207
One-to-One Merge shows that the new data set combines the first observation from CLASS with the first observation from TIME_SLOT, the second observation from CLASS with the second observation from TIME_SLOT, and so on.
Performing a One-to-One Merge on Data Sets with the Same Variables |
The previous example illustrates the simplest case of a one-to-one merge: the data sets contain the same number of observations, all variables have unique names, and you want to keep all variables from both data sets in the new data set. This example merges data sets that contain variables with the same names. Also, the second data set in this example contains one more observation than the first data set. Each data set contains data on a separate acting class.
In addition to the data set CLASS, the instructor also uses the data set CLASS2, which contains the same variables as CLASS but one more observation. The following program creates and displays the data set CLASS2:
data class2; input Name $ 1-25 Year $ 26-34 Major $ 36-50; datalines; Hitchcock-Tyler, Erin second Keil, Deborah third Theater Nacewicz, Chester third Theater Norgaard, Rolf second Prism, Lindsay fourth Anthropology Singh, Rajiv second Wittich, Stefan third Physics ;
proc print data=class2; title 'Acting Class Roster'; title2 '(second section)'; run;
The following output displays the data set CLASS2:
Acting Class Roster 1 (second section) Obs Name Year Major 1 Hitchcock-Tyler, Erin second 2 Keil, Deborah third Theater 3 Nacewicz, Chester third Theater 4 Norgaard, Rolf second 5 Prism, Lindsay fourth Anthropology 6 Singh, Rajiv second 7 Wittich, Stefan third Physics
Instead of scheduling conferences for one class, the instructor wants to schedule acting exercises for pairs of students, one student from each class. The instructor wants to create a data set in which each observation contains the name of one student from each class and the date, time, and location of the exercise. The variables Year and Major should not be in the new data set.
This new data set can be created by merging the data sets CLASS, CLASS2, and TIME_SLOT. Because Year and Major are not wanted in the new data set, the DROP= data set option can be used to drop them. Notice that the data sets CLASS and CLASS2 both contain the variable Name, but the values for Name are different in each data set. To preserve both sets of values, the RENAME= data set option must be used to rename the variable in one of the data sets.
The following program uses these data set options to merge the three data sets:
data exercise; merge class (drop=Year Major) class2 (drop=Year Major rename=(Name=Name2)) time_slot; run; proc print data=exercise; title 'Acting Class Exercise Schedule'; run;
The following output displays the new data set:
Acting Class Exercise Schedule 1 Obs Name Name2 Date Time Room 1 Abbott, Jennifer Hitchcock-Tyler, Erin 14SEP2000 10:00 103 2 Carter, Tom Keil, Deborah 14SEP2000 10:30 103 3 Kirby, Elissa Nacewicz, Chester 14SEP2000 11:00 207 4 Tucker, Rachel Norgaard, Rolf 15SEP2000 10:00 105 5 Uhl, Roland Prism, Lindsay 15SEP2000 10:30 105 6 Wacenske, Maurice Singh, Rajiv 17SEP2000 11:00 207 7 Wittich, Stefan .
The following steps describe how SAS merges the data sets:
Before executing the DATA step, SAS reads the descriptor portion of each data set that you name in the MERGE statement. Then SAS creates a program data vector for the new data set that, by default, contains all the variables from all data sets, as well as variables created by the DATA step. In this case, however, the DROP= data set option excludes the variables Year and Major from the program data vector. The RENAME= data set option adds the variable Name2 to the program data vector. Therefore, the program data vector contains the variables Name, Name2, Date, Time, and Room.
SAS sets the value of each variable in the program data vector to missing, as the next figure illustrates.
Program Data Vector before Reading from Data Sets
Next, SAS reads and copies the first observation from each data set into the program data vector (reading the data sets in the same order they appear in the MERGE statement), as the next figure illustrates.
Program Data Vector after Reading from Each Data Set
After processing the first observation from the last data set and executing any other statements in the DATA step, SAS writes the contents of the program data vector to the new data set. If the DATA step attempts to read past the end of a data set, then the values of all variables from that data set in the program data vector are set to missing.
This behavior has two important consequences:
If a variable exists in more than one data set, then the value from the last data set SAS reads is the value that goes into the new data set, even if that value is missing. If you want to keep all the values for like-named variables from different data sets, then you must rename one or more of the variables with the RENAME= data set option so that each variable has a unique name.
After SAS processes all observations in a data set, the program data vector and all subsequent observations in the new data set have missing values for the variables unique to that data set. So, as the next figure shows, the program data vector for the last observation in the new data set contains missing values for all variables except Name2.
Program Data Vector for the Last Observation
SAS continues to merge observations until it has copied all observations from all data sets.
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.