Updating SAS Data Sets

Updating a Data Set

In this example, the circulation department of a magazine maintains a mailing list that contains tens of thousands of names. Each issue of the magazine contains a form for readers to fill out when they change their names or addresses. To simplify the maintenance job, the form requests that readers send only new information. New subscribers can start a subscription by completing the entire form. When a form is received, a data entry operator enters the information on the form into a raw data file. The mailing list is updated once per month from the raw data file.

The mailing list includes these variables for each subscriber:

SubscriberId	is a unique number assigned to the subscriber at the time the subscription begins. A subscriber's SubscriberId never changes.
Name	is the subscriber's name. The last name appears first, followed by a comma and the first name.
StreetAddress	is the subscriber's street address.
City	is the subscriber's city.
StateProv	is the subscriber's state or province. This variable is missing for addresses outside the United States and Canada.
PostalCode	is the subscriber's postal code (zip code for addresses in the United States).
Country	is the subscriber's country.

The following program creates and displays the first part of this data set. The raw data are already sorted by SubscriberId.

options pagesize=60 linesize=80 pageno=1 nodate;

data mail_list;
   input SubscriberId 1-8 Name $ 9-27 StreetAddress $ 28-47 City $ 48-62
         StateProv $ 63-64 PostalCode $ 67-73 Country $ ;
   datalines;
1001    Ericson, Jane      111 Clancey Court   Chapel Hill    NC  27514   USA
1002    Dix, Martin        4 Shepherd St.      Vancouver      BC  V6C 3E8 Canada
1003    Gabrielli, Theresa Via Pisanelli, 25   Roma               00196   Italy
1004    Clayton, Aria      14 Bridge St.       San Francisco  CA  94124   USA
1005    Archuleta, Ruby    Box 108             Milagro        NM  87429   USA
1006    Misiewicz, Jeremy  43-C Lakeview Apts. Madison        WI  53704   USA
1007    Ahmadi, Hafez      52 Rue Marston      Paris              75019   France
1008    Jacobson, Becky    1 Lincoln St.       Tallahassee    FL  32312   USA
1009    An, Ing            95 Willow Dr.       Toronto        ON  M5J 2T3 Canada
1010    Slater, Emily      1009 Cherry St.     York           PA  17407   USA

...more data lines...
;

proc print data=mail_list (obs=10);
   title 'Magazine Master Mailing List';
run;

The following output shows the results:

The MAIL_LIST Data Set

                          Magazine Master Mailing List                         1

                            S
      S                     t
      u                     r
      b                     e                                       P
      s                     e                                 S     o
      c                     t                                 t     s
      r                     A                                 a     t    C
      i                     d                                 t     a    o
      b                     d                                 e     l    u
      e  N                  r                   C             P     C    n
  O   r  a                  e                   i             r     o    t
  b   I  m                  s                   t             o     d    r
  s   d  e                  s                   y             v     e    y

  1 1001 Ericson, Jane      111 Clancey Court   Chapel Hill   NC 27514   USA   
  2 1002 Dix, Martin        4 Shepherd St.      Vancouver     BC V6C 3E8 Canada
  3 1003 Gabrielli, Theresa Via Pisanelli, 25   Roma             00196   Italy 
  4 1004 Clayton, Aria      14 Bridge St.       San Francisco CA 94124   USA   
  5 1005 Archuleta, Ruby    Box 108             Milagro       NM 87429   USA   
  6 1006 Misiewicz, Jeremy  43-C Lakeview Apts. Madison       WI 53704   USA   
  7 1007 Ahmadi, Hafez      52 Rue Marston      Paris            75019   France
  8 1008 Jacobson, Becky    1 Lincoln St.       Tallahassee   FL 32312   USA   
  9 1009 An, Ing            95 Willow Dr.       Toronto       ON M5J 2T3 Canada
 10 1010 Slater, Emily      1009 Cherry St.     York          PA 17407   USA

This month the information that follows is received for updating the mailing list:

Martin Dix changed his name to Martin Dix-Rosen.
Jane Ericson's postal code changed.
Jeremy Misiewicz moved to a new street address. His city, state, and postal code remain the same.
Ing An moved from Toronto, Ontario, to Calgary, Alberta.
Martin Dix-Rosen, shortly after changing his name, moved from Vancouver, British Columbia, to Seattle, Washington.
Two new subscribers joined the list. They are given SubscriberID numbers 1011 and 1012.

Each change is entered into the raw data file as soon as it is received. In each case, only the customer's SubscriberId and the new information are entered. The raw data file looks like this:

1002    Dix-Rosen, Martin
1001                                                              27516
1006                       932 Webster St.
1009                       2540 Pleasant St.   Calgary        AB  T2P 4H2
1011    Mitchell, Wayne    28 Morningside Dr.  New York       NY  10017   USA
1002                       P.O. Box 1850       Seattle        WA  98101   USA
1012    Stavros, Gloria    212 Northampton Rd. South Hadley   MA  01075   USA

The data is in fixed columns, matching the INPUT statement that created MAIL_LIST.

First, you must transform the raw data into a SAS data set and sort that data set by SubscriberId so that you can use it to update the master list.

data mail_trans;
   infile 'your-input-file' missover;
   input SubscriberId 1-8 Name $ 9-27 StreetAddress $ 28-47 City $ 48-62
         StateProv $ 63-64 PostalCode $ 67-73 Country $ 75-80;
run;

proc sort data=mail_trans;
   by SubscriberId;
run;

proc print data=mail_trans;
   title 'Magazine Mailing List Changes';
   title2 '(for current month)';
run;

Note the MISSOVER option in the INFILE statement. The MISSOVER option prevents the INPUT statement from going to a new line to search for values for variables which have not received values; instead, any variables that have not received values are set to missing. For example, when the first record is read, the end of the record is encountered before any value has been assigned to the Country variable; instead of going to the next record to search for a value for Country, the Country variable is assigned a missing value. For more information about the MISSOVER option, see Starting with Raw Data: Beyond the Basics.

The following output shows the sorted data set MAIL_TRANS:

The MAIL_TRANS Data Set

                         Magazine Mailing List Changes                         1
                              (for current month)

                                     S
     S                               t
     u                               r
     b                               e                                P
     s                               e                         S      o
     c                               t                         t      s
     r                               A                         a      t      C
     i                               d                         t      a      o
     b                               d                         e      l      u
     e           N                   r           C             P      C      n
O    r           a                   e           i             r      o      t
b    I           m                   s           t             o      d      r
s    d           e                   s           y             v      e      y

1  1001                                                            27516       
2  1002  Dix-Rosen, Martin                                                     
3  1002                     P.O. Box 1850        Seattle       WA  98101    USA
4  1006                     932 Webster St.                                    
5  1009                     2540 Pleasant St.    Calgary       AB  T2P 4H2     
6  1011  Mitchell, Wayne    28 Morningside Dr.   New York      NY  10017    USA
7  1012  Stavros, Gloria    212 Northampton Rd.  South Hadley  MA  01075    USA

Now that the new data are in a sorted SAS data set, the following program updates the mailing list.

data mail_newlist;
   update mail_list mail_trans;
   by SubscriberId;
run;

proc print data=mail_newlist;
   title 'Magazine Mailing List';
   title2 '(updated for current month)';
run;

The following output shows the resulting data set MAIL_NEWLIST:

Updating a Data Set

                             Magazine Mailing List                             1
                          (updated for current month)

                            S
      S                     t
      u                     r
      b                     e                                       P
      s                     e                                 S     o
      c                     t                                 t     s
      r                     A                                 a     t    C
      i                     d                                 t     a    o
      b                     d                                 e     l    u
      e  N                  r                   C             P     C    n
  O   r  a                  e                   i             r     o    t
  b   I  m                  s                   t             o     d    r
  s   d  e                  s                   y             v     e    y

  1 1001 Ericson, Jane      111 Clancey Court   Chapel Hill   NC 27516   USA   
  2 1002 Dix-Rosen, Martin  P.O. Box 1850       Seattle       WA 98101   USA   
  3 1003 Gabrielli, Theresa Via Pisanelli, 25   Roma             00196   Italy 
  4 1004 Clayton, Aria      14 Bridge St.       San Francisco CA 94124   USA   
  5 1005 Archuleta, Ruby    Box 108             Milagro       NM 87429   USA   
  6 1006 Misiewicz, Jeremy  932 Webster St.     Madison       WI 53704   USA   
  7 1007 Ahmadi, Hafez      52 Rue Marston      Paris            75019   France
  8 1008 Jacobson, Becky    1 Lincoln St.       Tallahassee   FL 32312   USA   
  9 1009 An, Ing            2540 Pleasant St.   Calgary       AB T2P 4H2 Canada
 10 1010 Slater, Emily      1009 Cherry St.     York          PA 17407   USA   
 11 1011 Mitchell, Wayne    28 Morningside Dr.  New York      NY 10017   USA   
 12 1012 Stavros, Gloria    212 Northampton Rd. South Hadley  MA 01075   USA

The data for subscriber 1002, who has two update transactions, is used below to show what happens when you update an observation in the master data set with corresponding observations from the transaction data set.

Before executing the DATA step, SAS reads the descriptor portion of each data set named in the UPDATE statement and, by default, creates a program data vector that contains all the variables from all data sets. As the following figure illustrates, SAS sets the value of each variable to missing. (Use the DROP= or KEEP= data set option to exclude one or more variables.)

Program Data Vector before Execution of the DATA Step
Next, SAS reads the first observation from the master data set and copies it into the program data vector, as the following figure illustrates.

Program Data Vector after Reading the First Observation from the Master Data Set
SAS applies the first transaction by copying all nonmissing values (the value of Name) from the first observation in this BY group (ID=1002) into the program data vector, as the following figure illustrates.

Program Data Vector after Applying the First Transaction
After completing this transaction, SAS looks for another observation in the same BY group in the transaction data set. If it finds a second observation with the same value for ID, then it applies the second transaction too (new values for StreetAddress, City, StateProv, PostalCode, and Country). Now the observation contains the new values from both transactions, as the following figure illustrates.

Program Data Vector after Applying the Second Transaction
After completing the second transaction, SAS looks for a third observation in the same BY group. Because no such observation exists, it writes the observation in its current form to the new data set and sets the values in the program data vector to missing.

As the DATA step iterates, the UPDATE statement continues processing observations in this way until it reaches the end of the master and transaction data sets. The two observations in the transaction data set that describe new subscribers (and therefore have no corresponding observation in the master data set) become observations in the new data set.

Remember that if there are duplicate observations in the master data set, all matching observations in the transaction data set are applied only to the first of the duplicate observations in the master data set.

Top of Page