Previous Page | Next Page

Understanding DATA Step Processing

Adding Information to a SAS Data Set


Understanding the Assignment Statement

One of the most common reasons for using program statements in the DATA step is to produce new information from the original information or to change the information read by the INPUT or SET/MERGE/MODIFY/UPDATE statement. How do you add information to observations with a DATA step?

The basic method of adding information to a SAS data set is to create a new variable in a DATA step with an assignment statement. An assignment statement has the form:

variable=expression;

The variable receives the new information; the expression creates the new information. You specify the calculation necessary to produce the information and write the calculation as the expression. When the expression contains character data, you must enclose the data in quotation marks. SAS evaluates the expression and stores the new information in the variable that you name. It is important to remember that if you need to add the information to only one or two observations out of many, SAS creates that variable for all observations. The SAS data set that is being created must have information in every observation and every variable.


Making Uniform Changes to Data by Creating a Variable

Sometimes you want to make a particular change to every observation. For example, at Tradewinds Travel the airfare must be increased for every tour by $10 because of a new tax. One way to do this is to write an assignment statement that creates a new variable that calculates the new airfare:

NewAirCost = AirCost+10;

This statement directs SAS to read the value of AirCost, add 10 to it, and assign the result to the new variable, NewAirCost.

When this assignment statement is included in a DATA step, the DATA step looks like this:

options pagesize=60 linesize=80 pageno=1 nodate;
data newair;
   set mylib.internationaltours;
   NewAirCost = AirCost + 10;

proc print data=newair;
   var Country AirCost NewAirCost;
   title 'Increasing the Air Fare by $10 for All Tours';
run;

Note:   In this example, the VAR statement in the PROC PRINT step determines which variables are displayed in the output.   [cautionend]

The following output shows the resulting SAS data set, NEWAIR:

Adding Information to All Observations by Using a New Variable

                  Increasing the Air Fare by $10 for All Tours                 1

                                                    New 1 
                                            Air     Air
                         Obs    Country    Cost    Cost

                          1     France      793     803
                          2     Spain       805     815
                          3     India         .       . 2 
                          4     Peru        722     732

Notice in this data set that

[1] because SAS carries out each statement in the DATA step for every observation, NewAirCost is calculated during each iteration of the DATA step.

[2] the observation for India contains a missing value for AirCost; SAS therefore assigns a missing value to NewAirCost for that observation

The SAS data set has information in every observation and every variable.


Adding Information to Some Observations but Not Others

Often you need to add information to some observations but not to others. For example, some tour operators award bonus points to travel agencies for scheduling particular tours. Two companies, Hispania and Mundial, are offering bonus points this year.

IF-THEN/ELSE statements can cause assignment statements to be carried out only when a condition is met. In the following DATA step, the IF statements check the value of the variable Vendor. If the value is either Hispania or Mundial, information about the bonus points is added to those observations.

options pagesize=60 linesize=80 pageno=1 nodate;
data bonus;
   set mylib.internationaltours;
   if Vendor = 'Hispania' then BonusPoints = 'For 10+ people';
   else if Vendor = 'Mundial' then BonusPoints = 'Yes';
run;

proc print data=bonus;
   var Country Vendor BonusPoints;
   title1 'Adding Information to Observations for';
   title2 'Vendors Who Award Bonus Points';
run;

The following output displays the results:

Specifying Values for Specific Observations by Using a New Variable

                     Adding Information to Observations for                    1
                         Vendors Who Award Bonus Points

                  Obs    Country    Vendor      BonusPoints

                   1     France     Major            1         
                   2     Spain      Hispania    For 10+ people 2 
                   3     India      Royal            1          
                   4     Peru       Mundial     Yes           

The new variable BonusPoints has the following information:

[1] In the two observations that are not assigned a value for BonusPoints, SAS assigns a missing value, represented by a blank in this case, to indicate the absence of a character value.

[2] The first value that SAS encounters for BonusPoints contains 14 characters; therefore, SAS sets aside 14 bytes of storage in each observation for BonusPoints, regardless of the length of the value for that observation.


Making Uniform Changes to Data Without Creating Variables

Sometimes you want to change the value of existing variables without adding new variables. For example, in one DATA step a new variable, NewAirCost, was created to contain the value of the airfare plus the new $10 tax:

NewAirCost = AirCost + 10;

You can also decide to change the value of an existing variable rather than create a new variable. Following the example, AirCost is changed as follows:

AirCost = AirCost + 10;

SAS processes this statement just as it does other assignment statements. It evaluates the expression on the right side of the equal sign and assigns the result to the variable on the left side of the equal sign. The fact that the same variable appears on the right and left sides of the equal sign does not matter. SAS evaluates the expression on the right side of the equal sign before looking at the variable on the left side.

The following program contains the new assignment statement:

options pagesize=60 linesize=80 pageno=1 nodate;
data newair2;
   set mylib.internationaltours;
   AirCost = AirCost + 10;

proc print data=newair2;
   var Country AirCost;
   title 'Adding Tax to the Air Cost Without Adding a New Variable';
run;

The following output displays the results:

Changing the Information in a Variable

            Adding Tax to the Air Cost Without Adding a New Variable           1

                                                Air
                             Obs    Country    Cost

                              1     France      803
                              2     Spain       815
                              3     India         .
                              4     Peru        732

When you change the kind of information that a variable contains, you change the meaning of that variable. In this case, you are changing the meaning of AirCost from airfare without tax to airfare with tax. If you remember the current meaning and if you know that you do not need the original information, then changing a variable's values is useful. However, for many programmers, having separate variables is easier than recalling one variable whose definition changes.


Using Variables Efficiently

Variables that contain information that applies to only one or two observations use more storage space than necessary. When possible, create fewer variables that apply to more observations in the data set, and allow the different values in different observations to supply the information.

For example, the Major company offers discounts, not bonus points, for groups of 30 or more people. An inefficient program would create separate variables for bonus points and discounts, as follows:

   /* inefficient use of variables */
options pagesize=60 linesize=80 pageno=1 nodate;
data tourinfo;
   set mylib.internationaltours;
   if Vendor = 'Hispania' then BonusPoints = 'For 10+ people';
   else if Vendor = 'Mundial' then BonusPoints = 'Yes';
        else if Vendor = 'Major' then Discount = 'For 30+ people';
run;

proc print data=tourinfo;
   var Country Vendor BonusPoints Discount;
   title 'Information About Vendors';
run;

The following output displays the results:

Inefficient: Using Variables That Scatter Information Across Multiple Variables

                           Information About Vendors                           1

         Obs    Country    Vendor      BonusPoints          Discount

          1     France     Major                         For 30+ people
          2     Spain      Hispania    For 10+ people                  
          3     India      Royal                                       
          4     Peru       Mundial     Yes                             

As you can see, storage space is used inefficiently. Both BonusPoints and Discount have a significant number of missing values.

With a little planning, you can make the SAS data set much more efficient. In the following DATA step, the variable Remarks contains information about bonus points, discounts, and any other special features of any tour.

   /* efficient use of variables */
options pagesize=60 linesize=80 pageno=1 nodate;
data newinfo;
   set mylib.internationaltours;
   if Vendor = 'Hispania' then Remarks = 'Bonus for 10+ people';
   else if Vendor = 'Mundial' then Remarks = 'Bonus points';
        else if Vendor = 'Major' then Remarks = 'Discount: 30+ people';
run; 
  
proc print data=newinfo;
   var Country Vendor Remarks;
   title 'Information About Vendors';
run;

The following output displays a more efficient use of variables:

Efficient: Using Variables to Contain Maximum Information

                           Information About Vendors                           1

               Obs    Country    Vendor      Remarks

                1     France     Major       Discount: 30+ people
                2     Spain      Hispania    Bonus for 10+ people
                3     India      Royal                           
                4     Peru       Mundial     Bonus points        

Remarks has fewer missing values and contains all the information that is used by BonusPoints and Discount in the inefficient example. Using variables efficiently can save storage space and optimize your SAS data set.

Previous Page | Next Page | Top of Page