Understanding DATA Step Processing |
Understanding the Assignment Statement |
One of the most common reasons for using program statements in the DATA step is to produce new information from the original information or to change the information read by the INPUT or SET/MERGE/MODIFY/UPDATE statement. How do you add information to observations with a DATA step?
The basic method of adding information to a SAS data set is to create a new variable in a DATA step with an assignment statement. An assignment statement has the form:
variable=expression; |
The variable receives the new information; the expression creates the new information. You specify the calculation necessary to produce the information and write the calculation as the expression. When the expression contains character data, you must enclose the data in quotation marks. SAS evaluates the expression and stores the new information in the variable that you name. It is important to remember that if you need to add the information to only one or two observations out of many, SAS creates that variable for all observations. The SAS data set that is being created must have information in every observation and every variable.
Making Uniform Changes to Data by Creating a Variable |
Sometimes you want to make a particular change to every observation. For example, at Tradewinds Travel the airfare must be increased for every tour by $10 because of a new tax. One way to do this is to write an assignment statement that creates a new variable that calculates the new airfare:
NewAirCost = AirCost+10;
This statement directs SAS to read the value of AirCost, add 10 to it, and assign the result to the new variable, NewAirCost.
When this assignment statement is included in a DATA step, the DATA step looks like this:
options pagesize=60 linesize=80 pageno=1 nodate; data newair; set mylib.internationaltours; NewAirCost = AirCost + 10; proc print data=newair; var Country AirCost NewAirCost; title 'Increasing the Air Fare by $10 for All Tours'; run;
Note: In this example, the VAR statement in the PROC PRINT step determines which variables are displayed in the output.
The following output shows the resulting SAS data set, NEWAIR:
Adding Information to All Observations by Using a New Variable
Increasing the Air Fare by $10 for All Tours 1 New 1 Air Air Obs Country Cost Cost 1 France 793 803 2 Spain 805 815 3 India . . 2 4 Peru 722 732
The SAS data set has information in every observation and every variable.
Adding Information to Some Observations but Not Others |
Often you need to add information to some observations but not to others. For example, some tour operators award bonus points to travel agencies for scheduling particular tours. Two companies, Hispania and Mundial, are offering bonus points this year.
IF-THEN/ELSE statements can cause assignment statements to be carried out only when a condition is met. In the following DATA step, the IF statements check the value of the variable Vendor. If the value is either Hispania or Mundial, information about the bonus points is added to those observations.
options pagesize=60 linesize=80 pageno=1 nodate; data bonus; set mylib.internationaltours; if Vendor = 'Hispania' then BonusPoints = 'For 10+ people'; else if Vendor = 'Mundial' then BonusPoints = 'Yes'; run; proc print data=bonus; var Country Vendor BonusPoints; title1 'Adding Information to Observations for'; title2 'Vendors Who Award Bonus Points'; run;
The following output displays the results:
Specifying Values for Specific Observations by Using a New Variable
Adding Information to Observations for 1 Vendors Who Award Bonus Points Obs Country Vendor BonusPoints 1 France Major 1 2 Spain Hispania For 10+ people 2 3 India Royal 1 4 Peru Mundial Yes
The new variable BonusPoints has the following information:
Making Uniform Changes to Data Without Creating Variables |
Sometimes you want to change the value of existing variables without adding new variables. For example, in one DATA step a new variable, NewAirCost, was created to contain the value of the airfare plus the new $10 tax:
NewAirCost = AirCost + 10;
You can also decide to change the value of an existing variable rather than create a new variable. Following the example, AirCost is changed as follows:
AirCost = AirCost + 10;
SAS processes this statement just as it does other assignment statements. It evaluates the expression on the right side of the equal sign and assigns the result to the variable on the left side of the equal sign. The fact that the same variable appears on the right and left sides of the equal sign does not matter. SAS evaluates the expression on the right side of the equal sign before looking at the variable on the left side.
The following program contains the new assignment statement:
options pagesize=60 linesize=80 pageno=1 nodate; data newair2; set mylib.internationaltours; AirCost = AirCost + 10; proc print data=newair2; var Country AirCost; title 'Adding Tax to the Air Cost Without Adding a New Variable'; run;
The following output displays the results:
Changing the Information in a Variable
Adding Tax to the Air Cost Without Adding a New Variable 1 Air Obs Country Cost 1 France 803 2 Spain 815 3 India . 4 Peru 732
When you change the kind of information that a variable contains, you change the meaning of that variable. In this case, you are changing the meaning of AirCost from airfare without tax to airfare with tax. If you remember the current meaning and if you know that you do not need the original information, then changing a variable's values is useful. However, for many programmers, having separate variables is easier than recalling one variable whose definition changes.
Using Variables Efficiently |
Variables that contain information that applies to only one or two observations use more storage space than necessary. When possible, create fewer variables that apply to more observations in the data set, and allow the different values in different observations to supply the information.
For example, the Major company offers discounts, not bonus points, for groups of 30 or more people. An inefficient program would create separate variables for bonus points and discounts, as follows:
/* inefficient use of variables */ options pagesize=60 linesize=80 pageno=1 nodate; data tourinfo; set mylib.internationaltours; if Vendor = 'Hispania' then BonusPoints = 'For 10+ people'; else if Vendor = 'Mundial' then BonusPoints = 'Yes'; else if Vendor = 'Major' then Discount = 'For 30+ people'; run; proc print data=tourinfo; var Country Vendor BonusPoints Discount; title 'Information About Vendors'; run;
The following output displays the results:
Inefficient: Using Variables That Scatter Information Across Multiple Variables
Information About Vendors 1 Obs Country Vendor BonusPoints Discount 1 France Major For 30+ people 2 Spain Hispania For 10+ people 3 India Royal 4 Peru Mundial Yes
As you can see, storage space is used inefficiently. Both BonusPoints and Discount have a significant number of missing values.
With a little planning, you can make the SAS data set much more efficient. In the following DATA step, the variable Remarks contains information about bonus points, discounts, and any other special features of any tour.
/* efficient use of variables */ options pagesize=60 linesize=80 pageno=1 nodate; data newinfo; set mylib.internationaltours; if Vendor = 'Hispania' then Remarks = 'Bonus for 10+ people'; else if Vendor = 'Mundial' then Remarks = 'Bonus points'; else if Vendor = 'Major' then Remarks = 'Discount: 30+ people'; run; proc print data=newinfo; var Country Vendor Remarks; title 'Information About Vendors'; run;
The following output displays a more efficient use of variables:
Efficient: Using Variables to Contain Maximum Information
Information About Vendors 1 Obs Country Vendor Remarks 1 France Major Discount: 30+ people 2 Spain Hispania Bonus for 10+ people 3 India Royal 4 Peru Mundial Bonus points
Remarks has fewer missing values and contains all the information that is used by BonusPoints and Discount in the inefficient example. Using variables efficiently can save storage space and optimize your SAS data set.
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.