Updating SAS Data Sets

Understanding the Differences between Updating and Merging

General Comparisons between Updating and Merging

The MERGE statement and the UPDATE statement both match observations from two SAS data sets; however, the two statements differ significantly. It is important to distinguish between the two processes and to choose the one that is appropriate for your application.

The most straightforward differences are as follows:

The UPDATE statement uses only two data sets. The number of data sets that the MERGE statement can use is limited only by machine-dependent factors such as memory and disk space.
A BY statement must accompany an UPDATE statement. The MERGE statement performs a one-to-one merge if no BY statement follows it.
The two statements also process observations differently when a data set contains missing values or multiple observations in a BY group.

To illustrate the differences, compare updating the SAS data set MAIL_LIST with the data set MAIL_TRANS to merging the two data sets. You have already seen the results of updating in the example that created Updating a Data Set. That output appears again in the following output for easy comparison.

Updating a Data Set

                             Magazine Mailing List                             1
                          (updated for current month)

                            S
      S                     t
      u                     r
      b                     e                                       P
      s                     e                                 S     o
      c                     t                                 t     s
      r                     A                                 a     t    C
      i                     d                                 t     a    o
      b                     d                                 e     l    u
      e  N                  r                   C             P     C    n
  O   r  a                  e                   i             r     o    t
  b   I  m                  s                   t             o     d    r
  s   d  e                  s                   y             v     e    y

  1 1001 Ericson, Jane      111 Clancey Court   Chapel Hill   NC 27516   USA   
  2 1002 Dix-Rosen, Martin  P.O. Box 1850       Seattle       WA 98101   USA   
  3 1003 Gabrielli, Theresa Via Pisanelli, 25   Roma             00196   Italy 
  4 1004 Clayton, Aria      14 Bridge St.       San Francisco CA 94124   USA   
  5 1005 Archuleta, Ruby    Box 108             Milagro       NM 87429   USA   
  6 1006 Misiewicz, Jeremy  932 Webster St.     Madison       WI 53704   USA   
  7 1007 Ahmadi, Hafez      52 Rue Marston      Paris            75019   France
  8 1008 Jacobson, Becky    1 Lincoln St.       Tallahassee   FL 32312   USA   
  9 1009 An, Ing            2540 Pleasant St.   Calgary       AB T2P 4H2 Canada
 10 1010 Slater, Emily      1009 Cherry St.     York          PA 17407   USA   
 11 1011 Mitchell, Wayne    28 Morningside Dr.  New York      NY 10017   USA   
 12 1012 Stavros, Gloria    212 Northampton Rd. South Hadley  MA 01075   USA

In contrast, the following program merges the two data sets.

data mail_merged;
   merge mail_list mail_trans;
   by SubscriberId;
run;

proc print data=mail_merged;
   title 'Magazine Mailing List';
run;

The following output shows the results of the merge:

Results of Merging the Master and Transaction Data Sets

                             Magazine Mailing List                             1

                            S
      S                     t
      u                     r
      b                     e                                       P
      s                     e                                 S     o
      c                     t                                 t     s
      r                     A                                 a     t    C
      i                     d                                 t     a    o
      b                     d                                 e     l    u
      e          N          r                   C             P     C    n
  O   r          a          e                   i             r     o    t
  b   I          m          s                   t             o     d    r
  s   d          e          s                   y             v     e    y

  1 1001                                                         27516         
  2 1002 Dix-Rosen, Martin                                                     
  3 1002                    P.O. Box 1850       Seattle       WA 98101   USA   
  4 1003 Gabrielli, Theresa Via Pisanelli, 25   Roma             00196   Italy 
  5 1004 Clayton, Aria      14 Bridge St.       San Francisco CA 94124   USA   
  6 1005 Archuleta, Ruby    Box 108             Milagro       NM 87429   USA   
  7 1006                    932 Webster St.                                    
  8 1007 Ahmadi, Hafez      52 Rue Marston      Paris            75019   France
  9 1008 Jacobson, Becky    1 Lincoln St.       Tallahassee   FL 32312   USA   
 10 1009                    2540 Pleasant St.   Calgary       AB T2P 4H2       
 11 1010 Slater, Emily      1009 Cherry St.     York          PA 17407   USA   
 12 1011 Mitchell, Wayne    28 Morningside Dr.  New York      NY 10017   USA   
 13 1012 Stavros, Gloria    212 Northampton Rd. South Hadley  MA 01075   USA

The MERGE statement produces a data set containing 13 observations, whereas UPDATE produces a data set containing 12 observations. In addition, merging the data sets results in several missing values, whereas updating does not. Obviously, using the wrong statement may result in incorrect data. The differences between the merged and updated data sets result from the ways the two statements handle missing values and multiple observations in a BY group.

How the UPDATE and MERGE Statements Process Missing Values Differently

During an update, if a value for a variable is missing in the transaction data set, SAS uses the value from the master data set when it writes the observation to the new data set. When merging the same observations, SAS overwrites the value in the program data vector with the missing value. For example, the following observation exists in data set MAILING.MASTER.

1001    ERICSON, JANE      111 CLANCEY COURT   CHAPEL HILL    NC  27514

The following corresponding observation exists in MAILING.TRANS.

1001                                                              27516

Updating combines the two observations and creates the following observation:

1001    ERICSON, JANE      111 CLANCEY COURT   CHAPEL HILL    NC  27516

Merging combines the two observations and creates this observation:

1001                                                              27516

How the UPDATE and MERGE Statements Process Multiple Observations in a BY Group Differently

SAS does not write an updated observation to the new data set until it has applied all the transactions in a BY group. When merging data sets, SAS writes one new observation for each observation in the data set with the largest number of observations in the BY group. For example, consider this observation from MAILING.MASTER:

1002    DIX, MARTIN        4 SHEPHERD ST.      NORWICH        VT  05055

and the corresponding observations from MAILING.TRANS:

1002    DIX-ROSEN, MARTIN
1002                       R.R. 2, BOX 1850    HANOVER        NH  03755

The UPDATE statement applies both transactions and combines these observations into a single one:

1002    DIX-ROSEN, MARTIN  R.R. 2, BOX 1850    HANOVER        NH  03755

The MERGE statement, on the other hand, first merges the observation from MAILING.MASTER with the first observation in the corresponding BY group in MAILING.TRANS. All values of variables from the observation in MAILING.TRANS are used, even if they are missing. Then SAS writes the observation to the new data set:

1002    DIX-ROSEN, MARTIN

Next, SAS looks for other observations in the same BY group in each data set. Because more observations are in the BY group in MAILING.TRANS, all the values in the program data vector are retained. SAS merges them with the second observation in the BY group from MAILING.TRANS and writes the result to the new data set:

1002                       R.R. 2, BOX 1850    HANOVER        NH  03755

Therefore, merging creates two observations for the new data set, whereas updating creates only one.

Top of Page