Introduction to DATA Step Processing |
Overview of Creating a SAS Data Set |
You supply SAS with specific information for reading raw data so that you can create a SAS data set from the raw data. You can use the data set for further processing, data analysis, or report writing. To process raw data in a DATA step, you must
Telling SAS How to Read the Data: Styles of Input |
SAS provides many tools for reading raw data into a SAS data set. These tools include three basic input styles as well as various format modifiers and pointer controls.
List input is used when each field in the raw data is separated by at least one space and does not contain embedded spaces. The INPUT statement simply contains a list of the variable names. List input, however, places numerous restrictions on your data. These restrictions are discussed in detail in Starting with Raw Data: The Basics. The following example shows list input. Note that there is at least one blank space between each data value.
data scores; input Name $ Test_1 Test_2 Test_3; datalines; Bill 187 97 103 Carlos 156 76 74 Monique 99 102 129 ;
Column input enables you to read the same data if it is located in fixed columns:
data scores; input Name $ 1-7 Test_1 9-11 Test_2 13-15 Test_3 17-19; datalines; Bill 187 97 103 Carlos 156 76 74 Monique 99 102 129 ;
Formatted input enables you to supply special instructions in the INPUT statement for reading data. For example, to read numeric data that contains special symbols, you need to supply SAS with special instructions so that it can read the data correctly. These instructions, called informats, are discussed in more detail in Starting with Raw Data: The Basics. In the INPUT statement, you can specify an informat to be used to read a data value, as in the example that follows:
data total_sales; input Date mmddyy10. +2 Amount comma5.; datalines; 09/05/2000 1,382 10/19/2000 1,235 11/30/2000 2,391 ;
In this example, the MMDDYY10. informat for the variable Date tells SAS to interpret the raw data as a month, day, and year, ignoring the slashes. The COMMA5. informat for the variable Amount tells SAS to interpret the raw data as a number, ignoring the comma. The +2 is a pointer control that tells SAS where to look for the next item. For more information about pointer controls, see Starting with Raw Data: The Basics.
SAS also enables you to mix these styles of input as required by the way values are arranged in the data records. Starting with Raw Data: The Basics discusses in detail input styles (including their rules and restrictions), as well as additional data-reading tools.
Reading Dates with Two-Digit and Four-Digit Year Values |
In the previous example, the year values in the dates in the raw data had four digits:
09/05/2000 10/19/2000 11/30/2000
However, SAS is also capable of reading two-digit year values (for example, 09/05/99). In this example, use the MMDDYY8. informat for the variable Date.
How does SAS know to which century a two-digit year belongs? SAS uses the value of the YEARCUTOFF= SAS system option. In Version 7 and later of SAS, the default value of the YEARCUTOFF= option is 1920. This means that two-digit years from 00 to 19 are assumed to be in the twenty-first century, that is, 2000 to 2019. Two-digit years from 20 to 99 are assumed to be in the twentieth century, that is, 1920 to 1999.
Note: The YEARCUTOFF= option and the default setting may be different at your site.
To avoid confusion, you should use four-digit year values in your raw data wherever possible. For more information, see the Dates, Times, and Intervals section of SAS Language Reference: Concepts.
Defining Variables in SAS |
So far you have seen that the INPUT statement instructs SAS on how to read raw data lines. At the same time that the INPUT statement provides instructions for reading data, it defines the variables for the data set that come from the raw data. By assuming default values for variable attributes, the INPUT statement does much of the work for you. Later in this documentation, you will learn other statements that enable you to define variables and assign attributes to variables, but this section and Starting with Raw Data: The Basics concentrate on the use of the INPUT statement.
SAS variables can have these attributes:
See the SAS Variables section of SAS Language Reference: Concepts for more information about variable attributes.In an INPUT statement, you must supply each variable name. Unless you also supply an informat, the type is assumed to be numeric, and its length is assumed to be eight bytes. The following INPUT statement creates four numeric variables, each with a length of eight bytes, without requiring you to specify either type or length. The table summarizes this information.
input IdNumber Test_1 Test_2 Test_3;
Variable name | Type | Length |
---|---|---|
IdNumber | numeric | 8 |
Test_1 | numeric | 8 |
Test_2 | numeric | 8 |
Test_3 | numeric | 8 |
The values of numeric variables can contain only numbers. To store values that contain alphabetic or special characters, you must create a character variable. By following a variable name in an INPUT statement with a dollar sign ($), you create a character variable. The default length of a character variable is also eight bytes. The following statement creates a data set that contains one character variable and four numeric variables, all with a default length of eight bytes. The table summarizes this information.
input IdNumber Name $ Test_1 Test_2 Test_3;
Variable name | Type | Length |
---|---|---|
IdNumber | numeric | 8 |
Name | character | 8 |
Test_1 | numeric | 8 |
Test_2 | numeric | 8 |
Test_3 | numeric | 8 |
In addition to specifying the types of variables in the INPUT statement, you can also specify the lengths of character variables. Character variables can be up to 32,767 bytes in length. To specify the length of a character variable in an INPUT statement, you need to supply an informat or use column numbers. For example, following a variable name in the INPUT statement with the informat $20., or with column specifications such as 1-20, creates a character variable that is 20 bytes long.
Note that the length of numeric variables is not affected by informats or column specifications in an INPUT statement. See SAS Language Reference: Concepts for more information about numeric variables and lengths.
Two other variable attributes, format and label, affect how variable values and names are represented when they are printed or displayed. These attributes are assigned with different statements that you will learn about later.
Indicating the Location of Your Data |
To create a SAS data set, you can read data from one of four locations:
raw data in the data (job) stream, that is, following a DATALINES statement
raw data in a file that you specify with an INFILE statement
You can place data directly in the job stream with the programming statements that make up the DATA step. The DATALINES statement tells SAS that raw data follows. The single semicolon that follows the last line of data marks the end of the data. The DATALINES statement and data lines must occur last in the DATA step statements:
data weight_club; input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight; Loss = StartWeight - EndWeight; datalines; 1023 David Shaw red 189 165 1049 Amelia Serrano yellow 145 124 1219 Alan Nance red 210 192 1246 Ravi Sinha yellow 194 177 1078 Ashley McKnight red 127 118 ;
If your raw data is already stored in a file, then you do not have to bring that file into the data stream. Use an INFILE statement to specify the file containing the raw data. (See Using External Files in Your SAS Job for details about INFILE, FILE, and FILENAME statements.) The statements in the code that follows demonstrate the same example, this time showing that the raw data is stored in an external file:
data weight_club; infile 'your-input-file'; input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30; Loss=StartWeight-EndWeight; run;
You can also use data that is already stored in a SAS data set as input to a new data set. To read data from an existing SAS data set, you must specify the existing data set's name in one of these statements:
For example, the statements that follow create a new SAS data set named RED that adds the variable LossPercent:
data red; set weight_club; LossPercent = Loss / StartWeight * 100; run;
The SET statement indicates that the input data is already in the structure of a SAS data set and gives the name of the SAS data set to be read. In this example, the SET statement tells SAS to read the WEIGHT_CLUB data set in the WORK library.
If you have data that is stored in another vendor's database management system (DBMS) files, then you can use SAS/ACCESS software to bring this data into a SAS data set. SAS/ACCESS software enables you to assign a libref to a library containing the DBMS file. In this example, a libref is declared, and points to a library containing Oracle data. SAS reads data from an Oracle file into a SAS data set:
libname dblib oracle user=scott password=tiger path='hrdept_002'; data employees; set dblib.employees; run;
See SAS/ACCESS for Relational Databases: Reference for more information about using SAS/ACCESS software to access DBMS files.
Using External Files in Your SAS Job |
Your SAS programs often need to read raw data from a file, or write data or reports to a file that is not a SAS data set. To use a file that is not a SAS data set in a SAS program, you need to tell SAS where to find it. You can do the following:
Identify the file directly in the INFILE, FILE, or other SAS statement that uses the file.
Set up a fileref for the file by using the FILENAME statement, and then use the fileref in the INFILE, FILE, or other SAS statement.
Use operating environment commands to set up a fileref, and then use the fileref in the INFILE, FILE, or other SAS statement.
The first two methods are described here. The third method depends on the operating environment that you use.
Operating Environment Information: For more information, refer to the SAS documentation for your operating environment.
Identifying an External File Directly |
The simplest method for referring to an external file is to use the name of the file in the INFILE, FILE, or other SAS statement that needs to refer to the file. For example, if your raw data is stored in a file in your operating environment, and you want to read the data using a SAS DATA step, you can tell SAS where to find the raw data by putting the name of the file in the INFILE statement:
data temp; infile 'your-input-file'; input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30; run;
The INFILE statement for this example may appear as follows for various operating environments:
Operating environment | INFILE statement example |
z/OS | infile 'fitness.weight.rawdata(club1)'; |
CMS | infile 'club1 weight a'; |
OpenVMS | infile '[fitness.weight.rawdata]club1.dat'; |
UNIX | infile '/usr/local/fitness/club1.dat'; |
Windows | infile 'c:\fitness\club1.dat'; |
Operating Environment Information: For more information, refer to the SAS documentation for your operating environment.
Referencing an External File with a Fileref |
An alternate method for referencing an external file is to use the FILENAME statement to set up a fileref for a file. The fileref functions as a shorthand way of referring to an external file. You then use the fileref in later SAS statements that reference the file, such as the FILE or INFILE statement. The advantage of this method is that if the program contains many references to the same external file and the external filename changes, then the program needs to be modified in only one place, rather than in every place where the file is referenced.
Here is the form of the FILENAME statement:
FILENAME fileref 'your-input-or-output-file'; |
The fileref must be a valid SAS name, that is, it must
Operating Environment Information: Additional restrictions may apply under some operating environments. For more information, refer to the SAS documentation for your operating environment.
For example, you can reference the raw data that is stored in a file in your operating environment by first using the FILENAME statement to specify the name of the file and its fileref, and then using the INFILE statement with the same fileref to reference the file.
filename fitclub 'your-input-file'; data temp; infile fitclub; input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30; run;
In this example, the INFILE statement stays the same for all operating environments. The FILENAME statement, however, can appear differently in different operating environments, as the following table shows:
Operating environment | FILENAME statement example |
z/OS | filename fitclub 'fitness.weight.rawdata(club1)'; |
CMS | filename fitclub 'club1 weight a'; |
OpenVMS | filename fitclub '[fitness.weight.rawdata]club1.dat'; |
UNIX | filename fitclub '/usr/local/fitness/club1.dat'; |
Windows | filename fitclub 'c:\fitness\club1.dat'; |
If you need to use several files or members from the same directory, partitioned data set (PDS), or MACLIB, then you can use the FILENAME statement to create a fileref that identifies the name of the directory, PDS, or MACLIB. Then you can use the fileref in the INFILE statement and enclose the name of the file, PDS member, or MACLIB member in parentheses immediately after the fileref, as in this example:
filename fitclub 'directory-or-PDS-or-MACLIB'; data temp; infile fitclub(club1); input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30; run; data temp2; infile fitclub(club2); input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30; run;
In this case, the INFILE statements stay the same for all operating environments. The FILENAME statement, however, can appear differently for different operating environments, as the following table shows:
Operating environment | FILENAME statement example |
z/OS | filename fitclub 'fitness.weight.rawdata'; |
CMS | filename fitclub 'use1 maclib'; (table note 1) |
OpenVMS | filename fitclub '[fitness.weight.rawdata]'; |
UNIX | filename fitclub '/usr/local/fitness'; |
Windows | filename fitclub 'c:\fitness'; |
TABLE NOTE 1: Under CMS, the external file must be a CMS MACLIB, a CMS TXTLIB, or a z/OS PDS.
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.