Previous Page | Next Page

Introduction to DATA Step Processing

How the DATA Step Works: A Basic Introduction


Overview of the DATA Step

The DATA step consists of a group of SAS statements that begins with a DATA statement. The DATA statement begins the process of building a SAS data set and names the data set. The statements that make up the DATA step are compiled, and the syntax is checked. If the syntax is correct, then the statements are executed. In its simplest form, the DATA step is a loop with an automatic output and return action. The following figure illustrates the flow of action in a typical DATA step.

Flow of Action in a Typical DATA Step

[Flow of Action in a Typical DATA Step]


During the Compile Phase

When you submit a DATA step for execution, SAS checks the syntax of the SAS statements and compiles them, that is, automatically translates the statements into machine code. SAS further processes the code, and creates the following three items:

input buffer

is a logical area in memory into which SAS reads each record of data from a raw data file when the program executes. (When SAS reads from a SAS data set, however, the data is written directly to the program data vector.)

program data vector

is a logical area of memory where SAS builds a data set, one observation at a time. When a program executes, SAS reads data values from the input buffer or creates them by executing SAS language statements. SAS assigns the values to the appropriate variables in the program data vector. From here, SAS writes the values to a SAS data set as a single observation.

The program data vector also contains two automatic variables, _N_ and _ERROR_. The _N_ variable counts the number of times the DATA step begins to iterate. The _ERROR_ variable signals the occurrence of an error caused by the data during execution. These automatic variables are not written to the output data set.

descriptor information

is information about each SAS data set, including data set attributes and variable attributes. SAS creates and maintains the descriptor information.


During the Execution Phase

All executable statements in the DATA step are executed once for each iteration. If your input file contains raw data, then SAS reads a record into the input buffer. SAS then reads the values in the input buffer and assigns the values to the appropriate variables in the program data vector. SAS also calculates values for variables created by program statements, and writes these values to the program data vector. When the program reaches the end of the DATA step, three actions occur by default that make using the SAS language different from using most other programming languages:

  1. SAS writes the current observation from the program data vector to the data set.

  2. The program loops back to the top of the DATA step.

  3. Variables in the program data vector are reset to missing values.

    Note:   The following exceptions apply:

    • Variables that you specify in a RETAIN statement are not reset to missing values.

    • The automatic variables _N_ and _ERROR_ are not reset to missing.

    For information about the RETAIN statement, see Using a Value in a Later Observation.  [cautionend]

If there is another record to read, then the program executes again. SAS builds the second observation, and continues until there are no more records to read. The data set is then closed, and SAS goes on to the next DATA or PROC step.


Example of a DATA Step


The DATA Step

The following simple DATA step produces a SAS data set from the data collected for a health and fitness club. As discussed earlier, the input data contains each participant's identification number, name, team name, and weight at the beginning and end of a 16-week weight program:

data weight_club; 1 
   input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight; 2 
   Loss = StartWeight - EndWeight; 3 

   datalines; 4 
1023 David Shaw         red    189 165
1049 Amelia Serrano     yellow 145 124
1219 Alan Nance         red    210 192
1246 Ravi Sinha         yellow 194 177
1078 Ashley McKnight    red    127 118
1221 Jim Brown          yellow 220   .
1095 Susan Stewart      blue   135 127
1157 Rosa Gomez         green  155 141
1331 Jason Schock       blue   187 172
1067 Kanoko Nagasaka    green  135 122
1251 Richard Rose       blue   181 166
1333 Li-Hwa Lee         green  141 129
1192 Charlene Armstrong yellow 152 139
1352 Bette Long         green  156 137
1262 Yao Chen           blue   196 180
1087 Kim Sikorski       red    148 135
1124 Adrienne Fink      green  156 142
1197 Lynne Overby       red    138 125
1133 John VanMeter      blue   180 167
1036 Becky Redding      green  135 123
1057 Margie Vanhoy      yellow 146 132
1328 Hisashi Ito        red    155 142
1243 Deanna Hicks       blue   134 122
1177 Holly Choate       red    141 130
1259 Raoul Sanchez      green  189 172
1017 Jennifer Brooks    blue   138 127
1099 Asha Garg          yellow 148 132
1329 Larry Goss         yellow 188 174
; 4 
   


The Statements

The following list corresponds to the numbered items in the preceding program:

[1] The DATA statement begins the DATA step and names the data set that is being created.

[2] The INPUT statement creates five variables, indicates how SAS reads the values from the input buffer, and assigns the values to variables in the program data vector.

[3] The assignment statement creates an additional variable called Loss, calculates the value of Loss during each iteration of the DATA step, and writes the value to the program data vector.

[4] The DATALINES statement marks the beginning of the input data. The single semicolon marks the end of the input data and the DATA step.

Note:   A DATA step that does not contain a DATALINES statement must end with a RUN statement.  [cautionend]


The Process

When you submit a DATA step for execution, SAS automatically compiles the DATA step and then executes it. At compile time, SAS creates the input buffer, program data vector, and descriptor information for the data set WEIGHT_CLUB. As the following figure shows, the program data vector contains the variables that are named in the INPUT statement, as well as the variable Loss. The values of the _N_ and the _ERROR_ variables are automatically generated for every DATA step. The _N_ automatic variable represents the number of times that the DATA step has iterated. The _ERROR_ automatic variable acts like a binary switch whose value is 0 if no errors exist in the DATA step, or 1 if one or more errors exist. These automatic variables are not written to the output data set.

All variable values, except _N_ and _ERROR_, are initially set to missing. Note that missing numeric values are represented by a period, and missing character values are represented by a blank.

Variable Values Initially Set to Missing

[Variable Values Initially Set to Missing]

The syntax is correct, so the DATA step executes. As the following figure illustrates, the INPUT statement causes SAS to read the first record of raw data into the input buffer. Then, according to the instructions in the INPUT statement, SAS reads the data values in the input buffer and assigns them to variables in the program data vector.

Values Assigned to Variables by the INPUT Statement

[Values Assigned to Variables by the INPUT Statement]

When SAS assigns values to all variables that are listed in the INPUT statement, SAS executes the next statement in the program:

Loss = StartWeight - EndWeight;

This assignment statement calculates the value for the variable Loss and writes that value to the program data vector, as the following figure shows.

Value Computed and Assigned to the Variable Loss

[Value Computed and Assigned to the Variable Loss]

SAS has now reached the end of the DATA step, and the program automatically does the following:

Values Set to Missing

[Values Set to Missing]

Execution continues. The INPUT statement looks for another record to read. If there are no more records, then SAS closes the data set and the system goes on to the next DATA or PROC step. In this example, however, more records exist and the INPUT statement reads the second record into the input buffer, as the following figure shows.

Second Record Is Read into the Input Buffer

[Second Record Is Read into the Input Buffer]

The following figure shows that SAS assigned values to the variables in the program data vector and calculated the value for the variable Loss, building the second observation just as it did the first one.

Results of Second Iteration of the DATA Step

[Results of Second Iteration of the DATA Step]

This entire process continues until SAS detects the end of the file. The DATA step iterates as many times as there are records to read. Then SAS closes the data set WEIGHT_CLUB, and SAS looks for the beginning of the next DATA or PROC step.

Now that SAS has transformed the collected data from raw data into a SAS data set, it can be processed by a SAS procedure. The following output, produced with the PRINT procedure, shows the data set that has just been created.

proc print data=weight_club;
   title 'Fitness Center Weight Club';
run;

PROC PRINT Output of the WEIGHT_CLUB Data Set

                           Fitness Center Weight Club                          1

            Id                                       Start      End
   Obs    Number    Name                  Team      Weight    Weight    Loss

     1     1023     David Shaw            red         189       165      24 
     2     1049     Amelia Serrano        yellow      145       124      21 
     3     1219     Alan Nance            red         210       192      18 
     4     1246     Ravi Sinha            yellow      194       177      17 
     5     1078     Ashley McKnight       red         127       118       9 
     6     1221     Jim Brown             yellow      220         .       . 
     7     1095     Susan Stewart         blue        135       127       8 
     8     1157     Rosa Gomez            green       155       141      14 
     9     1331     Jason Schock          blue        187       172      15 
    10     1067     Kanoko Nagasaka       green       135       122      13 
    11     1251     Richard Rose          blue        181       166      15 
    12     1333     Li-Hwa Lee            green       141       129      12 
    13     1192     Charlene Armstrong    yellow      152       139      13 
    14     1352     Bette Long            green       156       137      19 
    15     1262     Yao Chen              blue        196       180      16 
    16     1087     Kim Sikorski          red         148       135      13 
    17     1124     Adrienne Fink         green       156       142      14 
    18     1197     Lynne Overby          red         138       125      13 
    19     1133     John VanMeter         blue        180       167      13 
    20     1036     Becky Redding         green       135       123      12 
    21     1057     Margie Vanhoy         yellow      146       132      14 
    22     1328     Hisashi Ito           red         155       142      13 
    23     1243     Deanna Hicks          blue        134       122      12 
    24     1177     Holly Choate          red         141       130      11 
    25     1259     Raoul Sanchez         green       189       172      17 
    26     1017     Jennifer Brooks       blue        138       127      11 
    27     1099     Asha Garg             yellow      148       132      16 
    28     1329     Larry Goss            yellow      188       174      14 

Previous Page | Next Page | Top of Page