Merging SAS Data Sets

One-to-One Merging

When you use the MERGE statement without a BY statement, SAS combines the first observation in all data sets you name in the MERGE statement into the first observation in the new data set, the second observation in all data sets into the second observation in the new data set, and so on. In a one-to-one merge, the number of observations in the new data set is equal to the number of observations in the largest data set you name in the MERGE statement.

Performing a Simple One-to-One Merge

Input SAS Data Set for Examples

For example, the instructor of a college acting class wants to schedule a conference with each student. One data set, CLASS, contains these variables:

Name	is the student's name.
Year	is the student's year: first, second, third, or fourth.
Major	is the student's area of specialization. This value is always missing for first-year and second-year students, who have not selected a major subject yet.

The following program creates and displays the data set CLASS:

data class;
   input Name $ 1-25 Year $ 26-34 Major $ 36-50;
   datalines;
Abbott, Jennifer         first
Carter, Tom              third     Theater
Kirby, Elissa            fourth    Mathematics
Tucker, Rachel           first
Uhl, Roland              second
Wacenske, Maurice        third     Theater
;

proc print data=class;
   title 'Acting Class Roster';
run;

The following output displays the data set CLASS:

The CLASS Data Set

                              Acting Class Roster                              1

              Obs    Name                 Year         Major

               1     Abbott, Jennifer     first                   
               2     Carter, Tom          third        Theater    
               3     Kirby, Elissa        fourth       Mathematics
               4     Tucker, Rachel       first                   
               5     Uhl, Roland          second                  
               6     Wacenske, Maurice    third        Theater

A second data set contains a list of the dates and times the instructor has scheduled conferences and the rooms in which the conferences are to take place. The following program creates and displays the data set TIME_SLOT. Note the use of the date format and informat.

data time_slot;
   input Date date9.  @12 Time $ @19 Room $;
   format date date9.;
   datalines;
14sep2000  10:00  103
14sep2000  10:30  103
14sep2000  11:00  207
15sep2000  10:00  105
15sep2000  10:30  105
17sep2000  11:00  207
;

proc print data=time_slot;
   title 'Dates, Times, and Locations of Conferences';
run;

The following output displays the data set TIME_SLOT:

The TIME_SLOT Data Set

                   Dates, Times, and Locations of Conferences                  1

                       Obs         Date    Time     Room

                        1     14SEP2000    10:00    103 
                        2     14SEP2000    10:30    103 
                        3     14SEP2000    11:00    207 
                        4     15SEP2000    10:00    105 
                        5     15SEP2000    10:30    105 
                        6     17SEP2000    11:00    207

The Program

The following program performs a one-to-one merge of these data sets, assigning a time slot for a conference to each student in the class.

data schedule;
   merge class time_slot;
run;

proc print data=schedule;
   title 'Student Conference Assignments';
run;

The following output displays the conference schedule data set:

One-to-One Merge

                         Student Conference Assignments                        1

  Obs   Name                Year        Major              Date   Time    Room

   1    Abbott, Jennifer    first                     14SEP2000   10:00   103 
   2    Carter, Tom         third       Theater       14SEP2000   10:30   103 
   3    Kirby, Elissa       fourth      Mathematics   14SEP2000   11:00   207 
   4    Tucker, Rachel      first                     15SEP2000   10:00   105 
   5    Uhl, Roland         second                    15SEP2000   10:30   105 
   6    Wacenske, Maurice   third       Theater       17SEP2000   11:00   207

Explanation

One-to-One Merge shows that the new data set combines the first observation from CLASS with the first observation from TIME_SLOT, the second observation from CLASS with the second observation from TIME_SLOT, and so on.

Performing a One-to-One Merge on Data Sets with the Same Variables

Input SAS Data Set for Examples

The previous example illustrates the simplest case of a one-to-one merge: the data sets contain the same number of observations, all variables have unique names, and you want to keep all variables from both data sets in the new data set. This example merges data sets that contain variables with the same names. Also, the second data set in this example contains one more observation than the first data set. Each data set contains data on a separate acting class.

In addition to the data set CLASS, the instructor also uses the data set CLASS2, which contains the same variables as CLASS but one more observation. The following program creates and displays the data set CLASS2:

data class2;
   input Name $ 1-25 Year $ 26-34 Major $ 36-50;
   datalines;
Hitchcock-Tyler, Erin    second
Keil, Deborah            third     Theater
Nacewicz, Chester        third     Theater
Norgaard, Rolf           second
Prism, Lindsay           fourth    Anthropology
Singh, Rajiv             second
Wittich, Stefan          third     Physics
;

proc print data=class2;
   title 'Acting Class Roster';
   title2 '(second section)';
run;

The following output displays the data set CLASS2:

The CLASS2 Data Set

                              Acting Class Roster                              1
                                (second section)

           Obs    Name                     Year         Major

            1     Hitchcock-Tyler, Erin    second                   
            2     Keil, Deborah            third        Theater     
            3     Nacewicz, Chester        third        Theater     
            4     Norgaard, Rolf           second                   
            5     Prism, Lindsay           fourth       Anthropology
            6     Singh, Rajiv             second                   
            7     Wittich, Stefan          third        Physics

The Program

Instead of scheduling conferences for one class, the instructor wants to schedule acting exercises for pairs of students, one student from each class. The instructor wants to create a data set in which each observation contains the name of one student from each class and the date, time, and location of the exercise. The variables Year and Major should not be in the new data set.

This new data set can be created by merging the data sets CLASS, CLASS2, and TIME_SLOT. Because Year and Major are not wanted in the new data set, the DROP= data set option can be used to drop them. Notice that the data sets CLASS and CLASS2 both contain the variable Name, but the values for Name are different in each data set. To preserve both sets of values, the RENAME= data set option must be used to rename the variable in one of the data sets.

The following program uses these data set options to merge the three data sets:

data exercise;
   merge class (drop=Year Major)
         class2 (drop=Year Major rename=(Name=Name2))
         time_slot;
run;

proc print data=exercise;
   title 'Acting Class Exercise Schedule';
run;

The following output displays the new data set:

Merging Three Data Sets

                         Acting Class Exercise Schedule                        1

Obs    Name                 Name2                         Date    Time     Room

 1     Abbott, Jennifer     Hitchcock-Tyler, Erin    14SEP2000    10:00    103 
 2     Carter, Tom          Keil, Deborah            14SEP2000    10:30    103 
 3     Kirby, Elissa        Nacewicz, Chester        14SEP2000    11:00    207 
 4     Tucker, Rachel       Norgaard, Rolf           15SEP2000    10:00    105 
 5     Uhl, Roland          Prism, Lindsay           15SEP2000    10:30    105 
 6     Wacenske, Maurice    Singh, Rajiv             17SEP2000    11:00    207 
 7                          Wittich, Stefan                  .

Explanation

The following steps describe how SAS merges the data sets:

Before executing the DATA step, SAS reads the descriptor portion of each data set that you name in the MERGE statement. Then SAS creates a program data vector for the new data set that, by default, contains all the variables from all data sets, as well as variables created by the DATA step. In this case, however, the DROP= data set option excludes the variables Year and Major from the program data vector. The RENAME= data set option adds the variable Name2 to the program data vector. Therefore, the program data vector contains the variables Name, Name2, Date, Time, and Room.
SAS sets the value of each variable in the program data vector to missing, as the next figure illustrates.

Program Data Vector before Reading from Data Sets
Next, SAS reads and copies the first observation from each data set into the program data vector (reading the data sets in the same order they appear in the MERGE statement), as the next figure illustrates.

Program Data Vector after Reading from Each Data Set
After processing the first observation from the last data set and executing any other statements in the DATA step, SAS writes the contents of the program data vector to the new data set. If the DATA step attempts to read past the end of a data set, then the values of all variables from that data set in the program data vector are set to missing.

This behavior has two important consequences:
- If a variable exists in more than one data set, then the value from the last data set SAS reads is the value that goes into the new data set, even if that value is missing. If you want to keep all the values for like-named variables from different data sets, then you must rename one or more of the variables with the RENAME= data set option so that each variable has a unique name.
- After SAS processes all observations in a data set, the program data vector and all subsequent observations in the new data set have missing values for the variables unique to that data set. So, as the next figure shows, the program data vector for the last observation in the new data set contains missing values for all variables except Name2.
Program Data Vector for the Last Observation
SAS continues to merge observations until it has copied all observations from all data sets.

Top of Page