Previous Page | Next Page

Starting with Raw Data: Beyond the Basics

Reading Multiple Records to Create a Single Observation


How the Data Records Are Structured

An earlier example (see Reading Character Data That Contains Embedded Blanks) shows data for several observations that are contained in a single record of raw data:

1023 David Shaw       red 189 165

This INPUT statement reads all the data values arranged across a single record:

input IdNumber 1-4 Name $ 6-23 Team $ StartWeight EndWeight;

Now, consider the opposite situation: when information for a single observation is not contained in a single record of raw data but is scattered across several records. For example, the health and fitness club data could be constructed in such a way that the information about a single member is spread across several records instead of in a single record:

1023 David Shaw
red
189 165


Method 1: Using Multiple Input Statements

Multiple INPUT statements, one for each record, can read each record into a single observation, as in this example:

input IdNumber 1-4 Name $ 6-23;    
input Team $ 1-6;
input StartWeight 1-3 EndWeight 5-7;

To understand how to use multiple INPUT statements, consider what happens as a DATA step executes. Remember that one record is read into the INPUT buffer automatically as each INPUT statement is encountered during each iteration. SAS reads the data values from the input buffer and writes them to the program data vector as variable values. At the end of the DATA step, all the variable values in the program data vector are written automatically as a single observation.

This example uses multiple INPUT statements in a DATA step to read only selected data fields and create a data set containing only the variables IdNumber, StartWeight, and EndWeight.

data club2;
   input IdNumber 1-4;  1 
   input;  2 
   input StartWeight 1-3 EndWeight 5-7;  3 
   datalines;
1023 David Shaw
red
189 165
1049 Amelia Serrano
yellow
145 124 
1219 Alan Nance
red
210 192
1246 Ravi Sinha
yellow
194 177
1078 Ashley McKnight
red
127 118
1221 Jim Brown
yellow
220  .
;

proc print data=club2;
   title 'Weight Club Members';
run;

The following list corresponds to the numbered items in the preceding program:

[1] The first INPUT statement reads only one data field in the first record and assigns a value to the variable IdNumber.

[2] The second INPUT statement, without arguments, is a null INPUT statement that reads the second record into the input buffer. However, it does not assign a value to a variable.

[3] The third INPUT statement reads the third record into the input buffer and assigns values to the variables StartWeight and EndWeight.

The following output shows the resulting data set:

Data Set Created with Multiple INPUT Statements

                              Weight Club Members                              1

                                Id       Start      End
                       Obs    Number    Weight    Weight

                        1      1023       189       165 
                        2      1049       145       124 
                        3      1219       210       192 
                        4      1246       194       177 
                        5      1078       127       118 
                        6      1221       220         . 

Method 2: Using the / Line-Pointer Control

Writing a separate INPUT statement for each record is not the only way to create a single observation. You can write a single INPUT statement and use the slash (/) line-pointer control. The slash line-pointer control forces a new record into the input buffer and positions the pointer at the beginning of that record.

This example uses only one INPUT statement to read multiple records:

data club2;
   input IdNumber 1-4 / / StartWeight 1-3 EndWeight 5-7;
   datalines;
1023 David Shaw
red
189 165
1049 Amelia Serrano
yellow
145 124
1219 Alan Nance
red
210 192
1246 Ravi Sinha
yellow
194 177
1078 Ashley McKnight
red
127 118
1221 Jim Brown
yellow
220   . 
;

proc print data=club2;
   title 'Weight Club Members';
run;

The / line-pointer control appears exactly where a new INPUT statement begins in the previous example (see Method 1: Using Multiple Input Statements). The sequence of events in the input buffer and the program data vector as this DATA step executes is identical to the previous example in method 1. The / is the signal to read a new record into the input buffer, which happens automatically when the DATA step encounters a new INPUT statement. The preceding example shows two slashes (/ /), indicating that SAS skips a record. SAS reads the first record, skips the second record, and reads the third record.

The following output shows the resulting data set:

Data Set Created with the / Line-Pointer Control

                              Weight Club Members                              1

                                Id       Start      End
                       Obs    Number    Weight    Weight

                        1      1023       189       165 
                        2      1049       145       124 
                        3      1219       210       192 
                        4      1246       194       177 
                        5      1078       127       118 
                        6      1221       220         . 

Reading Variables from Multiple Records in Any Order

You can also read multiple records to create a single observation by pointing to a specific record in a set of input records with the #n line-pointer control. As you saw in the last section, the advantage of using the / line-pointer control over multiple INPUT statements is that it requires fewer statements. However, using the #n line-pointer control enables you to read the variables in any order, no matter which record contains the data values. It is also useful if you want to skip data lines.

This example uses one INPUT statement to read multiple data lines in a different order:

data club2;
   input #2 Team $ 1-6 #1 Name $ 6-23 IdNumber 1-4           
         #3 StartWeight 1-3 EndWeight 5-7;
   datalines;
1023 David Shaw
red
189 165
1049 Amelia Serrano
yellow
145 124
1219 Alan Nance
red
210 192
1246 Ravi Sinha
yellow
194 177
1078 Ashley McKnight
red
127 118
1221 Jim Brown
yellow
220   . 
;

proc print data=club2;
   title 'Weight Club Members';
run;

The following output shows the resulting data set:

Data Set Created with the #n Line-Pointer Control

                              Weight Club Members                              1

                                               Id       Start      End
         Obs    Team      Name               Number    Weight    Weight

          1     red       David Shaw          1023       189       165 
          2     yellow    Amelia Serrano      1049       145       124 
          3     red       Alan Nance          1219       210       192 
          4     yellow    Ravi Sinha          1246       194       177 
          5     red       Ashley McKnight     1078       127       118 
          6     yellow    Jim Brown           1221       220         . 

The order of the observations is the same as in the raw records ( shown in the section Reading Variables from Multiple Records in Any Order). However, the order of the variables in the data set differs from the order of the variables in the raw input data records. This occurs because the order of the variables in the INPUT statements corresponds with their order in the resulting data sets.


Understanding How the #n Line-Pointer Control Affects DATA Step Execution

To understand the importance of the #n line-pointer control, remember the sequence of events in the DATA steps that demonstrate the / line-pointer control and multiple INPUT statements. Each record is read into the input buffer sequentially. The data is read, and then a / or a new INPUT statement causes the program to read the next record into the input buffer. It is impossible for the program to read a value from the first record after a value from the second record is read because the data in the first record is no longer available in the input buffer.

To solve this problem, use the #n line-pointer control. The #n line-pointer control signals the program to create a multiple-line input buffer so that all the data for a single observation is available while the observation is being built in the program data vector. The #n line-pointer control also identifies the record in which data for each variable appears. To use the #n line-pointer control, the raw data must have the same number of records for each observation; for example, it cannot have three records for one observation and two for the next.

When the program compiles and builds the input buffer, it looks at the INPUT statement and creates an input buffer with as many lines as are necessary to contain the number of records it needs to read for a single observation. In this example, the highest number of records specified is three, so the input buffer is built to contain three records at one time. The following figures demonstrate the flow of the DATA step in this example.

This figure shows that the values are set to missing in the program data vector and that the INPUT statement reads the first three records into the input buffer.

Three Records Are Read into the Input Buffer as a Single Observation

[Three Records Are Read into the Input Buffer as a Single Observation]

The INPUT statement for this example is as follows:
input #2 Team $ 1-6 
      #1 Name $ 6-23 IdNumber 1-4 
      #3 StartWeight 1-3 EndWeight 5-7;

The first variable is preceded by #2 to indicate that the value in the second record is assigned to the variable Team. The following figure shows that the pointer advances to the second line in the input buffer, reads the value, and writes it to the program data vector.

Reading from the Second Record First

[Reading from the Second Record First]

The following figure shows that the pointer then moves to the sixth column in the first record, reads a value, and assigns it to the variable Name in the program data vector. It then moves to the first column to read the ID number, and assigns it to the variable IdNumber.

Reading from the First Record

[Reading from the First Record]

The following figure shows that the process continues with the pointer moving to the third record in the first observation. Values are read and assigned to StartWeight and EndWeight, the last variable that is listed.

Reading from the Third Record

[Reading from the Third Record]

When the bottom of the DATA step is reached, variable values in the program data vector are written as an observation to the data set. The DATA step returns to the top, and values in the program data vector are set to missing. The INPUT statement executes again. The final figure shows that the next three records are read into the input buffer, ready to create the second observation.

Reading the Next Three Records into the Input Buffer

[Reading the Next Three Records into the Input Buffer]

Previous Page | Next Page | Top of Page