Updating SAS Data Sets |
In this example, the circulation department of a magazine maintains a mailing list that contains tens of thousands of names. Each issue of the magazine contains a form for readers to fill out when they change their names or addresses. To simplify the maintenance job, the form requests that readers send only new information. New subscribers can start a subscription by completing the entire form. When a form is received, a data entry operator enters the information on the form into a raw data file. The mailing list is updated once per month from the raw data file.
The mailing list includes these variables for each subscriber:
The following program creates and displays the first part of this data set. The raw data are already sorted by SubscriberId.
options pagesize=60 linesize=80 pageno=1 nodate; data mail_list; input SubscriberId 1-8 Name $ 9-27 StreetAddress $ 28-47 City $ 48-62 StateProv $ 63-64 PostalCode $ 67-73 Country $ ; datalines; 1001 Ericson, Jane 111 Clancey Court Chapel Hill NC 27514 USA 1002 Dix, Martin 4 Shepherd St. Vancouver BC V6C 3E8 Canada 1003 Gabrielli, Theresa Via Pisanelli, 25 Roma 00196 Italy 1004 Clayton, Aria 14 Bridge St. San Francisco CA 94124 USA 1005 Archuleta, Ruby Box 108 Milagro NM 87429 USA 1006 Misiewicz, Jeremy 43-C Lakeview Apts. Madison WI 53704 USA 1007 Ahmadi, Hafez 52 Rue Marston Paris 75019 France 1008 Jacobson, Becky 1 Lincoln St. Tallahassee FL 32312 USA 1009 An, Ing 95 Willow Dr. Toronto ON M5J 2T3 Canada 1010 Slater, Emily 1009 Cherry St. York PA 17407 USA ...more data lines... ; proc print data=mail_list (obs=10); title 'Magazine Master Mailing List'; run;
The following output shows the results:
Magazine Master Mailing List 1 S S t u r b e P s e S o c t t s r A a t C i d t a o b d e l u e N r C P C n O r a e i r o t b I m s t o d r s d e s y v e y 1 1001 Ericson, Jane 111 Clancey Court Chapel Hill NC 27514 USA 2 1002 Dix, Martin 4 Shepherd St. Vancouver BC V6C 3E8 Canada 3 1003 Gabrielli, Theresa Via Pisanelli, 25 Roma 00196 Italy 4 1004 Clayton, Aria 14 Bridge St. San Francisco CA 94124 USA 5 1005 Archuleta, Ruby Box 108 Milagro NM 87429 USA 6 1006 Misiewicz, Jeremy 43-C Lakeview Apts. Madison WI 53704 USA 7 1007 Ahmadi, Hafez 52 Rue Marston Paris 75019 France 8 1008 Jacobson, Becky 1 Lincoln St. Tallahassee FL 32312 USA 9 1009 An, Ing 95 Willow Dr. Toronto ON M5J 2T3 Canada 10 1010 Slater, Emily 1009 Cherry St. York PA 17407 USA
This month the information that follows is received for updating the mailing list:
Jeremy Misiewicz moved to a new street address. His city, state, and postal code remain the same.
Martin Dix-Rosen, shortly after changing his name, moved from Vancouver, British Columbia, to Seattle, Washington.
Two new subscribers joined the list. They are given SubscriberID numbers 1011 and 1012.
Each change is entered into the raw data file as soon as it is received. In each case, only the customer's SubscriberId and the new information are entered. The raw data file looks like this:
1002 Dix-Rosen, Martin 1001 27516 1006 932 Webster St. 1009 2540 Pleasant St. Calgary AB T2P 4H2 1011 Mitchell, Wayne 28 Morningside Dr. New York NY 10017 USA 1002 P.O. Box 1850 Seattle WA 98101 USA 1012 Stavros, Gloria 212 Northampton Rd. South Hadley MA 01075 USA
The data is in fixed columns, matching the INPUT statement that created MAIL_LIST.
First, you must transform the raw data into a SAS data set and sort that data set by SubscriberId so that you can use it to update the master list.
data mail_trans; infile 'your-input-file' missover; input SubscriberId 1-8 Name $ 9-27 StreetAddress $ 28-47 City $ 48-62 StateProv $ 63-64 PostalCode $ 67-73 Country $ 75-80; run; proc sort data=mail_trans; by SubscriberId; run;
proc print data=mail_trans; title 'Magazine Mailing List Changes'; title2 '(for current month)'; run;
Note the MISSOVER option in the INFILE statement. The MISSOVER option prevents the INPUT statement from going to a new line to search for values for variables which have not received values; instead, any variables that have not received values are set to missing. For example, when the first record is read, the end of the record is encountered before any value has been assigned to the Country variable; instead of going to the next record to search for a value for Country, the Country variable is assigned a missing value. For more information about the MISSOVER option, see Starting with Raw Data: Beyond the Basics.
The following output shows the sorted data set MAIL_TRANS:
Magazine Mailing List Changes 1 (for current month) S S t u r b e P s e S o c t t s r A a t C i d t a o b d e l u e N r C P C n O r a e i r o t b I m s t o d r s d e s y v e y 1 1001 27516 2 1002 Dix-Rosen, Martin 3 1002 P.O. Box 1850 Seattle WA 98101 USA 4 1006 932 Webster St. 5 1009 2540 Pleasant St. Calgary AB T2P 4H2 6 1011 Mitchell, Wayne 28 Morningside Dr. New York NY 10017 USA 7 1012 Stavros, Gloria 212 Northampton Rd. South Hadley MA 01075 USA
Now that the new data are in a sorted SAS data set, the following program updates the mailing list.
data mail_newlist; update mail_list mail_trans; by SubscriberId; run; proc print data=mail_newlist; title 'Magazine Mailing List'; title2 '(updated for current month)'; run;
The following output shows the resulting data set MAIL_NEWLIST:
Magazine Mailing List 1 (updated for current month) S S t u r b e P s e S o c t t s r A a t C i d t a o b d e l u e N r C P C n O r a e i r o t b I m s t o d r s d e s y v e y 1 1001 Ericson, Jane 111 Clancey Court Chapel Hill NC 27516 USA 2 1002 Dix-Rosen, Martin P.O. Box 1850 Seattle WA 98101 USA 3 1003 Gabrielli, Theresa Via Pisanelli, 25 Roma 00196 Italy 4 1004 Clayton, Aria 14 Bridge St. San Francisco CA 94124 USA 5 1005 Archuleta, Ruby Box 108 Milagro NM 87429 USA 6 1006 Misiewicz, Jeremy 932 Webster St. Madison WI 53704 USA 7 1007 Ahmadi, Hafez 52 Rue Marston Paris 75019 France 8 1008 Jacobson, Becky 1 Lincoln St. Tallahassee FL 32312 USA 9 1009 An, Ing 2540 Pleasant St. Calgary AB T2P 4H2 Canada 10 1010 Slater, Emily 1009 Cherry St. York PA 17407 USA 11 1011 Mitchell, Wayne 28 Morningside Dr. New York NY 10017 USA 12 1012 Stavros, Gloria 212 Northampton Rd. South Hadley MA 01075 USA
The data for subscriber 1002, who has two update transactions, is used below to show what happens when you update an observation in the master data set with corresponding observations from the transaction data set.
Before executing the DATA step, SAS reads the descriptor portion of each data set named in the UPDATE statement and, by default, creates a program data vector that contains all the variables from all data sets. As the following figure illustrates, SAS sets the value of each variable to missing. (Use the DROP= or KEEP= data set option to exclude one or more variables.)
Program Data Vector before Execution of the DATA Step
Next, SAS reads the first observation from the master data set and copies it into the program data vector, as the following figure illustrates.
Program Data Vector after Reading the First Observation from the Master Data Set
SAS applies the first transaction by copying all nonmissing values (the value of Name) from the first observation in this BY group (ID=1002) into the program data vector, as the following figure illustrates.
Program Data Vector after Applying the First Transaction
After completing this transaction, SAS looks for another observation in the same BY group in the transaction data set. If it finds a second observation with the same value for ID, then it applies the second transaction too (new values for StreetAddress, City, StateProv, PostalCode, and Country). Now the observation contains the new values from both transactions, as the following figure illustrates.
Program Data Vector after Applying the Second Transaction
After completing the second transaction, SAS looks for a third observation in the same BY group. Because no such observation exists, it writes the observation in its current form to the new data set and sets the values in the program data vector to missing.
As the DATA step iterates, the UPDATE statement continues processing observations in this way until it reaches the end of the master and transaction data sets. The two observations in the transaction data set that describe new subscribers (and therefore have no corresponding observation in the master data set) become observations in the new data set.
Remember that if there are duplicate observations in the master data set, all matching observations in the transaction data set are applied only to the first of the duplicate observations in the master data set.
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.