Updating SAS Data Sets |
General Comparisons between Updating and Merging |
The MERGE statement and the UPDATE statement both match observations from two SAS data sets; however, the two statements differ significantly. It is important to distinguish between the two processes and to choose the one that is appropriate for your application.
The most straightforward differences are as follows:
The UPDATE statement uses only two data sets. The number of data sets that the MERGE statement can use is limited only by machine-dependent factors such as memory and disk space.
A BY statement must accompany an UPDATE statement. The MERGE statement performs a one-to-one merge if no BY statement follows it.
The two statements also process observations differently when a data set contains missing values or multiple observations in a BY group.
To illustrate the differences, compare updating the SAS data set MAIL_LIST with the data set MAIL_TRANS to merging the two data sets. You have already seen the results of updating in the example that created Updating a Data Set. That output appears again in the following output for easy comparison.
Magazine Mailing List 1 (updated for current month) S S t u r b e P s e S o c t t s r A a t C i d t a o b d e l u e N r C P C n O r a e i r o t b I m s t o d r s d e s y v e y 1 1001 Ericson, Jane 111 Clancey Court Chapel Hill NC 27516 USA 2 1002 Dix-Rosen, Martin P.O. Box 1850 Seattle WA 98101 USA 3 1003 Gabrielli, Theresa Via Pisanelli, 25 Roma 00196 Italy 4 1004 Clayton, Aria 14 Bridge St. San Francisco CA 94124 USA 5 1005 Archuleta, Ruby Box 108 Milagro NM 87429 USA 6 1006 Misiewicz, Jeremy 932 Webster St. Madison WI 53704 USA 7 1007 Ahmadi, Hafez 52 Rue Marston Paris 75019 France 8 1008 Jacobson, Becky 1 Lincoln St. Tallahassee FL 32312 USA 9 1009 An, Ing 2540 Pleasant St. Calgary AB T2P 4H2 Canada 10 1010 Slater, Emily 1009 Cherry St. York PA 17407 USA 11 1011 Mitchell, Wayne 28 Morningside Dr. New York NY 10017 USA 12 1012 Stavros, Gloria 212 Northampton Rd. South Hadley MA 01075 USA
In contrast, the following program merges the two data sets.
data mail_merged; merge mail_list mail_trans; by SubscriberId; run; proc print data=mail_merged; title 'Magazine Mailing List'; run;
The following output shows the results of the merge:
Results of Merging the Master and Transaction Data Sets
Magazine Mailing List 1 S S t u r b e P s e S o c t t s r A a t C i d t a o b d e l u e N r C P C n O r a e i r o t b I m s t o d r s d e s y v e y 1 1001 27516 2 1002 Dix-Rosen, Martin 3 1002 P.O. Box 1850 Seattle WA 98101 USA 4 1003 Gabrielli, Theresa Via Pisanelli, 25 Roma 00196 Italy 5 1004 Clayton, Aria 14 Bridge St. San Francisco CA 94124 USA 6 1005 Archuleta, Ruby Box 108 Milagro NM 87429 USA 7 1006 932 Webster St. 8 1007 Ahmadi, Hafez 52 Rue Marston Paris 75019 France 9 1008 Jacobson, Becky 1 Lincoln St. Tallahassee FL 32312 USA 10 1009 2540 Pleasant St. Calgary AB T2P 4H2 11 1010 Slater, Emily 1009 Cherry St. York PA 17407 USA 12 1011 Mitchell, Wayne 28 Morningside Dr. New York NY 10017 USA 13 1012 Stavros, Gloria 212 Northampton Rd. South Hadley MA 01075 USA
The MERGE statement produces a data set containing 13 observations, whereas UPDATE produces a data set containing 12 observations. In addition, merging the data sets results in several missing values, whereas updating does not. Obviously, using the wrong statement may result in incorrect data. The differences between the merged and updated data sets result from the ways the two statements handle missing values and multiple observations in a BY group.
How the UPDATE and MERGE Statements Process Missing Values Differently |
During an update, if a value for a variable is missing in the transaction data set, SAS uses the value from the master data set when it writes the observation to the new data set. When merging the same observations, SAS overwrites the value in the program data vector with the missing value. For example, the following observation exists in data set MAILING.MASTER.
1001 ERICSON, JANE 111 CLANCEY COURT CHAPEL HILL NC 27514
The following corresponding observation exists in MAILING.TRANS.
1001 27516
Updating combines the two observations and creates the following observation:
1001 ERICSON, JANE 111 CLANCEY COURT CHAPEL HILL NC 27516
Merging combines the two observations and creates this observation:
1001 27516
How the UPDATE and MERGE Statements Process Multiple Observations in a BY Group Differently |
SAS does not write an updated observation to the new data set until it has applied all the transactions in a BY group. When merging data sets, SAS writes one new observation for each observation in the data set with the largest number of observations in the BY group. For example, consider this observation from MAILING.MASTER:
1002 DIX, MARTIN 4 SHEPHERD ST. NORWICH VT 05055
and the corresponding observations from MAILING.TRANS:
1002 DIX-ROSEN, MARTIN 1002 R.R. 2, BOX 1850 HANOVER NH 03755
The UPDATE statement applies both transactions and combines these observations into a single one:
1002 DIX-ROSEN, MARTIN R.R. 2, BOX 1850 HANOVER NH 03755
The MERGE statement, on the other hand, first merges the observation from MAILING.MASTER with the first observation in the corresponding BY group in MAILING.TRANS. All values of variables from the observation in MAILING.TRANS are used, even if they are missing. Then SAS writes the observation to the new data set:
1002 DIX-ROSEN, MARTIN
Next, SAS looks for other observations in the same BY group in each data set. Because more observations are in the BY group in MAILING.TRANS, all the values in the program data vector are retained. SAS merges them with the second observation in the BY group from MAILING.TRANS and writes the result to the new data set:
1002 R.R. 2, BOX 1850 HANOVER NH 03755
Therefore, merging creates two observations for the new data set, whereas updating creates only one.
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.