Starting with Raw Data: Beyond the Basics |
How the Data Records Are Structured |
An earlier example (see Reading Character Data That Contains Embedded Blanks) shows data for several observations that are contained in a single record of raw data:
1023 David Shaw red 189 165
This INPUT statement reads all the data values arranged across a single record:
input IdNumber 1-4 Name $ 6-23 Team $ StartWeight EndWeight;
Now, consider the opposite situation: when information for a single observation is not contained in a single record of raw data but is scattered across several records. For example, the health and fitness club data could be constructed in such a way that the information about a single member is spread across several records instead of in a single record:
1023 David Shaw red 189 165
Method 1: Using Multiple Input Statements |
Multiple INPUT statements, one for each record, can read each record into a single observation, as in this example:
input IdNumber 1-4 Name $ 6-23; input Team $ 1-6; input StartWeight 1-3 EndWeight 5-7;
To understand how to use multiple INPUT statements, consider what happens as a DATA step executes. Remember that one record is read into the INPUT buffer automatically as each INPUT statement is encountered during each iteration. SAS reads the data values from the input buffer and writes them to the program data vector as variable values. At the end of the DATA step, all the variable values in the program data vector are written automatically as a single observation.
This example uses multiple INPUT statements in a DATA step to read only selected data fields and create a data set containing only the variables IdNumber, StartWeight, and EndWeight.
data club2; input IdNumber 1-4; 1 input; 2 input StartWeight 1-3 EndWeight 5-7; 3 datalines; 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 1221 Jim Brown yellow 220 . ; proc print data=club2; title 'Weight Club Members'; run;
The following list corresponds to the numbered items in the preceding program:
The following output shows the resulting data set:
Data Set Created with Multiple INPUT Statements
Weight Club Members 1 Id Start End Obs Number Weight Weight 1 1023 189 165 2 1049 145 124 3 1219 210 192 4 1246 194 177 5 1078 127 118 6 1221 220 .
Method 2: Using the / Line-Pointer Control |
Writing a separate INPUT statement for each record is not the only way to create a single observation. You can write a single INPUT statement and use the slash (/) line-pointer control. The slash line-pointer control forces a new record into the input buffer and positions the pointer at the beginning of that record.
This example uses only one INPUT statement to read multiple records:
data club2; input IdNumber 1-4 / / StartWeight 1-3 EndWeight 5-7; datalines; 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 1221 Jim Brown yellow 220 . ; proc print data=club2; title 'Weight Club Members'; run;
The / line-pointer control appears exactly where a new INPUT statement begins in the previous example (see Method 1: Using Multiple Input Statements). The sequence of events in the input buffer and the program data vector as this DATA step executes is identical to the previous example in method 1. The / is the signal to read a new record into the input buffer, which happens automatically when the DATA step encounters a new INPUT statement. The preceding example shows two slashes (/ /), indicating that SAS skips a record. SAS reads the first record, skips the second record, and reads the third record.
The following output shows the resulting data set:
Data Set Created with the / Line-Pointer Control
Weight Club Members 1 Id Start End Obs Number Weight Weight 1 1023 189 165 2 1049 145 124 3 1219 210 192 4 1246 194 177 5 1078 127 118 6 1221 220 .
Reading Variables from Multiple Records in Any Order |
You can also read multiple records to create a single observation by pointing to a specific record in a set of input records with the #n line-pointer control. As you saw in the last section, the advantage of using the / line-pointer control over multiple INPUT statements is that it requires fewer statements. However, using the #n line-pointer control enables you to read the variables in any order, no matter which record contains the data values. It is also useful if you want to skip data lines.
This example uses one INPUT statement to read multiple data lines in a different order:
data club2; input #2 Team $ 1-6 #1 Name $ 6-23 IdNumber 1-4 #3 StartWeight 1-3 EndWeight 5-7; datalines; 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 1221 Jim Brown yellow 220 . ; proc print data=club2; title 'Weight Club Members'; run;
The following output shows the resulting data set:
Data Set Created with the #n Line-Pointer Control
Weight Club Members 1 Id Start End Obs Team Name Number Weight Weight 1 red David Shaw 1023 189 165 2 yellow Amelia Serrano 1049 145 124 3 red Alan Nance 1219 210 192 4 yellow Ravi Sinha 1246 194 177 5 red Ashley McKnight 1078 127 118 6 yellow Jim Brown 1221 220 .
The order of the observations is the same as in the raw records ( shown in the section Reading Variables from Multiple Records in Any Order). However, the order of the variables in the data set differs from the order of the variables in the raw input data records. This occurs because the order of the variables in the INPUT statements corresponds with their order in the resulting data sets.
Understanding How the #n Line-Pointer Control Affects DATA Step Execution |
To understand the importance of the #n line-pointer control, remember the sequence of events in the DATA steps that demonstrate the / line-pointer control and multiple INPUT statements. Each record is read into the input buffer sequentially. The data is read, and then a / or a new INPUT statement causes the program to read the next record into the input buffer. It is impossible for the program to read a value from the first record after a value from the second record is read because the data in the first record is no longer available in the input buffer.
To solve this problem, use the #n line-pointer control. The #n line-pointer control signals the program to create a multiple-line input buffer so that all the data for a single observation is available while the observation is being built in the program data vector. The #n line-pointer control also identifies the record in which data for each variable appears. To use the #n line-pointer control, the raw data must have the same number of records for each observation; for example, it cannot have three records for one observation and two for the next.
When the program compiles and builds the input buffer, it looks at the INPUT statement and creates an input buffer with as many lines as are necessary to contain the number of records it needs to read for a single observation. In this example, the highest number of records specified is three, so the input buffer is built to contain three records at one time. The following figures demonstrate the flow of the DATA step in this example.
This figure shows that the values are set to missing in the program data vector and that the INPUT statement reads the first three records into the input buffer.
Three Records Are Read into the Input Buffer as a Single Observation
The INPUT statement for this example is as follows:input #2 Team $ 1-6 #1 Name $ 6-23 IdNumber 1-4 #3 StartWeight 1-3 EndWeight 5-7;
The first variable is preceded by #2 to indicate that the value in the second record is assigned to the variable Team. The following figure shows that the pointer advances to the second line in the input buffer, reads the value, and writes it to the program data vector.
Reading from the Second Record First
The following figure shows that the pointer then moves to the sixth column in the first record, reads a value, and assigns it to the variable Name in the program data vector. It then moves to the first column to read the ID number, and assigns it to the variable IdNumber.Reading from the First Record
The following figure shows that the process continues with the pointer moving to the third record in the first observation. Values are read and assigned to StartWeight and EndWeight, the last variable that is listed.
Reading from the Third Record
When the bottom of the DATA step is reached, variable values in the program data vector are written as an observation to the data set. The DATA step returns to the top, and values in the program data vector are set to missing. The INPUT statement executes again. The final figure shows that the next three records are read into the input buffer, ready to create the second observation.
Reading the Next Three Records into the Input Buffer
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.