Introduction to DATA Step Processing |
Overview of the DATA Step |
The DATA step consists of a group of SAS statements that begins with a DATA statement. The DATA statement begins the process of building a SAS data set and names the data set. The statements that make up the DATA step are compiled, and the syntax is checked. If the syntax is correct, then the statements are executed. In its simplest form, the DATA step is a loop with an automatic output and return action. The following figure illustrates the flow of action in a typical DATA step.
Flow of Action in a Typical DATA Step
During the Compile Phase |
When you submit a DATA step for execution, SAS checks the syntax of the SAS statements and compiles them, that is, automatically translates the statements into machine code. SAS further processes the code, and creates the following three items:
During the Execution Phase |
All executable statements in the DATA step are executed once for each iteration. If your input file contains raw data, then SAS reads a record into the input buffer. SAS then reads the values in the input buffer and assigns the values to the appropriate variables in the program data vector. SAS also calculates values for variables created by program statements, and writes these values to the program data vector. When the program reaches the end of the DATA step, three actions occur by default that make using the SAS language different from using most other programming languages:
SAS writes the current observation from the program data vector to the data set.
Variables in the program data vector are reset to missing values.
Note: The following exceptions apply:
Variables that you specify in a RETAIN statement are not reset to missing values.
The automatic variables _N_ and _ERROR_ are not reset to missing.
If there is another record to read, then the program executes again. SAS builds the second observation, and continues until there are no more records to read. The data set is then closed, and SAS goes on to the next DATA or PROC step.
Example of a DATA Step |
The following simple DATA step produces a SAS data set from the data collected for a health and fitness club. As discussed earlier, the input data contains each participant's identification number, name, team name, and weight at the beginning and end of a 16-week weight program:
data weight_club; 1 input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight; 2 Loss = StartWeight - EndWeight; 3
datalines; 4 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 1221 Jim Brown yellow 220 . 1095 Susan Stewart blue 135 127 1157 Rosa Gomez green 155 141 1331 Jason Schock blue 187 172 1067 Kanoko Nagasaka green 135 122 1251 Richard Rose blue 181 166 1333 Li-Hwa Lee green 141 129 1192 Charlene Armstrong yellow 152 139 1352 Bette Long green 156 137 1262 Yao Chen blue 196 180 1087 Kim Sikorski red 148 135 1124 Adrienne Fink green 156 142 1197 Lynne Overby red 138 125 1133 John VanMeter blue 180 167 1036 Becky Redding green 135 123 1057 Margie Vanhoy yellow 146 132 1328 Hisashi Ito red 155 142 1243 Deanna Hicks blue 134 122 1177 Holly Choate red 141 130 1259 Raoul Sanchez green 189 172 1017 Jennifer Brooks blue 138 127 1099 Asha Garg yellow 148 132 1329 Larry Goss yellow 188 174 ; 4
The following list corresponds to the numbered items in the preceding program:
When you submit a DATA step for execution, SAS automatically compiles the DATA step and then executes it. At compile time, SAS creates the input buffer, program data vector, and descriptor information for the data set WEIGHT_CLUB. As the following figure shows, the program data vector contains the variables that are named in the INPUT statement, as well as the variable Loss. The values of the _N_ and the _ERROR_ variables are automatically generated for every DATA step. The _N_ automatic variable represents the number of times that the DATA step has iterated. The _ERROR_ automatic variable acts like a binary switch whose value is 0 if no errors exist in the DATA step, or 1 if one or more errors exist. These automatic variables are not written to the output data set.
All variable values, except _N_ and _ERROR_, are initially set to missing. Note that missing numeric values are represented by a period, and missing character values are represented by a blank.
Variable Values Initially Set to Missing
The syntax is correct, so the DATA step executes. As the following figure illustrates, the INPUT statement causes SAS to read the first record of raw data into the input buffer. Then, according to the instructions in the INPUT statement, SAS reads the data values in the input buffer and assigns them to variables in the program data vector.
Values Assigned to Variables by the INPUT Statement
When SAS assigns values to all variables that are listed in the INPUT statement, SAS executes the next statement in the program:
Loss = StartWeight - EndWeight;
This assignment statement calculates the value for the variable Loss and writes that value to the program data vector, as the following figure shows.
Value Computed and Assigned to the Variable Loss
SAS has now reached the end of the DATA step, and the program automatically does the following:
loops back to the top of the DATA step to begin the next iteration
increments the _N_ automatic variable by 1
resets the _ERROR_ automatic variable to 0
except for _N_ and _ERROR_, sets variable values in the program data vector to missing values, as the following figure shows
Values Set to Missing
Execution continues. The INPUT statement looks for another record to read. If there are no more records, then SAS closes the data set and the system goes on to the next DATA or PROC step. In this example, however, more records exist and the INPUT statement reads the second record into the input buffer, as the following figure shows.
Second Record Is Read into the Input Buffer
The following figure shows that SAS assigned values to the variables in the program data vector and calculated the value for the variable Loss, building the second observation just as it did the first one.
Results of Second Iteration of the DATA Step
This entire process continues until SAS detects the end of the file. The DATA step iterates as many times as there are records to read. Then SAS closes the data set WEIGHT_CLUB, and SAS looks for the beginning of the next DATA or PROC step.
Now that SAS has transformed the collected data from raw data into a SAS data set, it can be processed by a SAS procedure. The following output, produced with the PRINT procedure, shows the data set that has just been created.
proc print data=weight_club; title 'Fitness Center Weight Club'; run;
PROC PRINT Output of the WEIGHT_CLUB Data Set
Fitness Center Weight Club 1 Id Start End Obs Number Name Team Weight Weight Loss 1 1023 David Shaw red 189 165 24 2 1049 Amelia Serrano yellow 145 124 21 3 1219 Alan Nance red 210 192 18 4 1246 Ravi Sinha yellow 194 177 17 5 1078 Ashley McKnight red 127 118 9 6 1221 Jim Brown yellow 220 . . 7 1095 Susan Stewart blue 135 127 8 8 1157 Rosa Gomez green 155 141 14 9 1331 Jason Schock blue 187 172 15 10 1067 Kanoko Nagasaka green 135 122 13 11 1251 Richard Rose blue 181 166 15 12 1333 Li-Hwa Lee green 141 129 12 13 1192 Charlene Armstrong yellow 152 139 13 14 1352 Bette Long green 156 137 19 15 1262 Yao Chen blue 196 180 16 16 1087 Kim Sikorski red 148 135 13 17 1124 Adrienne Fink green 156 142 14 18 1197 Lynne Overby red 138 125 13 19 1133 John VanMeter blue 180 167 13 20 1036 Becky Redding green 135 123 12 21 1057 Margie Vanhoy yellow 146 132 14 22 1328 Hisashi Ito red 155 142 13 23 1243 Deanna Hicks blue 134 122 12 24 1177 Holly Choate red 141 130 11 25 1259 Raoul Sanchez green 189 172 17 26 1017 Jennifer Brooks blue 138 127 11 27 1099 Asha Garg yellow 148 132 16 28 1329 Larry Goss yellow 188 174 14
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.