Previous Page | Next Page

Introduction to DATA Step Processing

Supplying Information to Create a SAS Data Set


Overview of Creating a SAS Data Set

You supply SAS with specific information for reading raw data so that you can create a SAS data set from the raw data. You can use the data set for further processing, data analysis, or report writing. To process raw data in a DATA step, you must


Telling SAS How to Read the Data: Styles of Input

SAS provides many tools for reading raw data into a SAS data set. These tools include three basic input styles as well as various format modifiers and pointer controls.

List input is used when each field in the raw data is separated by at least one space and does not contain embedded spaces. The INPUT statement simply contains a list of the variable names. List input, however, places numerous restrictions on your data. These restrictions are discussed in detail in Starting with Raw Data: The Basics. The following example shows list input. Note that there is at least one blank space between each data value.

data scores;
   input Name $ Test_1 Test_2 Test_3;
   datalines;
Bill 187 97 103
Carlos 156 76 74
Monique 99 102 129
;

Column input enables you to read the same data if it is located in fixed columns:

data scores;
   input Name $ 1-7 Test_1 9-11 Test_2 13-15 Test_3 17-19;
   datalines;
Bill    187  97 103
Carlos  156  76  74
Monique  99 102 129
;

Formatted input enables you to supply special instructions in the INPUT statement for reading data. For example, to read numeric data that contains special symbols, you need to supply SAS with special instructions so that it can read the data correctly. These instructions, called informats, are discussed in more detail in Starting with Raw Data: The Basics. In the INPUT statement, you can specify an informat to be used to read a data value, as in the example that follows:

data total_sales;
   input Date mmddyy10. +2 Amount comma5.;
   datalines;
09/05/2000  1,382
10/19/2000  1,235
11/30/2000  2,391
;

In this example, the MMDDYY10. informat for the variable Date tells SAS to interpret the raw data as a month, day, and year, ignoring the slashes. The COMMA5. informat for the variable Amount tells SAS to interpret the raw data as a number, ignoring the comma. The +2 is a pointer control that tells SAS where to look for the next item. For more information about pointer controls, see Starting with Raw Data: The Basics.

SAS also enables you to mix these styles of input as required by the way values are arranged in the data records. Starting with Raw Data: The Basics discusses in detail input styles (including their rules and restrictions), as well as additional data-reading tools.


Reading Dates with Two-Digit and Four-Digit Year Values

In the previous example, the year values in the dates in the raw data had four digits:

09/05/2000 
10/19/2000 
11/30/2000 

However, SAS is also capable of reading two-digit year values (for example, 09/05/99). In this example, use the MMDDYY8. informat for the variable Date.

How does SAS know to which century a two-digit year belongs? SAS uses the value of the YEARCUTOFF= SAS system option. In Version 7 and later of SAS, the default value of the YEARCUTOFF= option is 1920. This means that two-digit years from 00 to 19 are assumed to be in the twenty-first century, that is, 2000 to 2019. Two-digit years from 20 to 99 are assumed to be in the twentieth century, that is, 1920 to 1999.

Note:   The YEARCUTOFF= option and the default setting may be different at your site.  [cautionend]

To avoid confusion, you should use four-digit year values in your raw data wherever possible. For more information, see the Dates, Times, and Intervals section of SAS Language Reference: Concepts.


Defining Variables in SAS

So far you have seen that the INPUT statement instructs SAS on how to read raw data lines. At the same time that the INPUT statement provides instructions for reading data, it defines the variables for the data set that come from the raw data. By assuming default values for variable attributes, the INPUT statement does much of the work for you. Later in this documentation, you will learn other statements that enable you to define variables and assign attributes to variables, but this section and Starting with Raw Data: The Basics concentrate on the use of the INPUT statement.

SAS variables can have these attributes:

See the SAS Variables section of SAS Language Reference: Concepts for more information about variable attributes.

In an INPUT statement, you must supply each variable name. Unless you also supply an informat, the type is assumed to be numeric, and its length is assumed to be eight bytes. The following INPUT statement creates four numeric variables, each with a length of eight bytes, without requiring you to specify either type or length. The table summarizes this information.

input IdNumber Test_1 Test_2 Test_3;

Variable name Type Length
IdNumber numeric 8
Test_1 numeric 8
Test_2 numeric 8
Test_3 numeric 8

The values of numeric variables can contain only numbers. To store values that contain alphabetic or special characters, you must create a character variable. By following a variable name in an INPUT statement with a dollar sign ($), you create a character variable. The default length of a character variable is also eight bytes. The following statement creates a data set that contains one character variable and four numeric variables, all with a default length of eight bytes. The table summarizes this information.

input IdNumber Name $ Test_1 Test_2 Test_3;

Variable name Type Length
IdNumber numeric 8
Name character 8
Test_1 numeric 8
Test_2 numeric 8
Test_3 numeric 8

In addition to specifying the types of variables in the INPUT statement, you can also specify the lengths of character variables. Character variables can be up to 32,767 bytes in length. To specify the length of a character variable in an INPUT statement, you need to supply an informat or use column numbers. For example, following a variable name in the INPUT statement with the informat $20., or with column specifications such as 1-20, creates a character variable that is 20 bytes long.

Note that the length of numeric variables is not affected by informats or column specifications in an INPUT statement. See SAS Language Reference: Concepts for more information about numeric variables and lengths.

Two other variable attributes, format and label, affect how variable values and names are represented when they are printed or displayed. These attributes are assigned with different statements that you will learn about later.


Indicating the Location of Your Data


Data Locations

To create a SAS data set, you can read data from one of four locations:


Raw Data in the Job Stream

You can place data directly in the job stream with the programming statements that make up the DATA step. The DATALINES statement tells SAS that raw data follows. The single semicolon that follows the last line of data marks the end of the data. The DATALINES statement and data lines must occur last in the DATA step statements:

data weight_club;
   input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight;
   Loss = StartWeight - EndWeight;
   datalines;
1023 David Shaw         red    189 165
1049 Amelia Serrano     yellow 145 124
1219 Alan Nance         red    210 192
1246 Ravi Sinha         yellow 194 177
1078 Ashley McKnight    red    127 118
;


Data in an External File

If your raw data is already stored in a file, then you do not have to bring that file into the data stream. Use an INFILE statement to specify the file containing the raw data. (See Using External Files in Your SAS Job for details about INFILE, FILE, and FILENAME statements.) The statements in the code that follows demonstrate the same example, this time showing that the raw data is stored in an external file:

data weight_club;
   infile 'your-input-file';
   input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 
         EndWeight 28-30;
   Loss=StartWeight-EndWeight;
run;


Data in a SAS Data Set

You can also use data that is already stored in a SAS data set as input to a new data set. To read data from an existing SAS data set, you must specify the existing data set's name in one of these statements:

For example, the statements that follow create a new SAS data set named RED that adds the variable LossPercent:

data red;
   set weight_club;
   LossPercent = Loss / StartWeight * 100;
run;

The SET statement indicates that the input data is already in the structure of a SAS data set and gives the name of the SAS data set to be read. In this example, the SET statement tells SAS to read the WEIGHT_CLUB data set in the WORK library.


Data in a DBMS File

If you have data that is stored in another vendor's database management system (DBMS) files, then you can use SAS/ACCESS software to bring this data into a SAS data set. SAS/ACCESS software enables you to assign a libref to a library containing the DBMS file. In this example, a libref is declared, and points to a library containing Oracle data. SAS reads data from an Oracle file into a SAS data set:

libname dblib oracle user=scott password=tiger path='hrdept_002';
data employees;
   set dblib.employees;
run;

See SAS/ACCESS for Relational Databases: Reference for more information about using SAS/ACCESS software to access DBMS files.


Using External Files in Your SAS Job

Your SAS programs often need to read raw data from a file, or write data or reports to a file that is not a SAS data set. To use a file that is not a SAS data set in a SAS program, you need to tell SAS where to find it. You can do the following:

The first two methods are described here. The third method depends on the operating environment that you use.

Operating Environment Information:   For more information, refer to the SAS documentation for your operating environment.  [cautionend]


Identifying an External File Directly

The simplest method for referring to an external file is to use the name of the file in the INFILE, FILE, or other SAS statement that needs to refer to the file. For example, if your raw data is stored in a file in your operating environment, and you want to read the data using a SAS DATA step, you can tell SAS where to find the raw data by putting the name of the file in the INFILE statement:

data temp;
   infile 'your-input-file';
   input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 
         EndWeight 28-30;
run;

The INFILE statement for this example may appear as follows for various operating environments:

Example INFILE Statements for Various Operating Environments
Operating environment INFILE statement example
z/OS infile 'fitness.weight.rawdata(club1)';
CMS infile 'club1 weight a';
OpenVMS infile '[fitness.weight.rawdata]club1.dat';
UNIX infile '/usr/local/fitness/club1.dat';
Windows infile 'c:\fitness\club1.dat';

Operating Environment Information:   For more information, refer to the SAS documentation for your operating environment.  [cautionend]


Referencing an External File with a Fileref

An alternate method for referencing an external file is to use the FILENAME statement to set up a fileref for a file. The fileref functions as a shorthand way of referring to an external file. You then use the fileref in later SAS statements that reference the file, such as the FILE or INFILE statement. The advantage of this method is that if the program contains many references to the same external file and the external filename changes, then the program needs to be modified in only one place, rather than in every place where the file is referenced.

Here is the form of the FILENAME statement:

FILENAME fileref 'your-input-or-output-file';

The fileref must be a valid SAS name, that is, it must

Operating Environment Information:   Additional restrictions may apply under some operating environments. For more information, refer to the SAS documentation for your operating environment.  [cautionend]

For example, you can reference the raw data that is stored in a file in your operating environment by first using the FILENAME statement to specify the name of the file and its fileref, and then using the INFILE statement with the same fileref to reference the file.

filename fitclub 'your-input-file';

data temp;
   infile fitclub;
   input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30;
run;

In this example, the INFILE statement stays the same for all operating environments. The FILENAME statement, however, can appear differently in different operating environments, as the following table shows:

Example FILENAME Statements for Various Operating Environments
Operating environment FILENAME statement example
z/OS filename fitclub 'fitness.weight.rawdata(club1)';
CMS filename fitclub 'club1 weight a';
OpenVMS filename fitclub '[fitness.weight.rawdata]club1.dat';
UNIX filename fitclub '/usr/local/fitness/club1.dat';
Windows filename fitclub 'c:\fitness\club1.dat';

If you need to use several files or members from the same directory, partitioned data set (PDS), or MACLIB, then you can use the FILENAME statement to create a fileref that identifies the name of the directory, PDS, or MACLIB. Then you can use the fileref in the INFILE statement and enclose the name of the file, PDS member, or MACLIB member in parentheses immediately after the fileref, as in this example:

filename fitclub 'directory-or-PDS-or-MACLIB';

data temp;
   infile fitclub(club1);
   input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30;
run;

data temp2;
   infile fitclub(club2);
   input IdNumber $ 1-4 Name $ 6-23 StartWeight 24-26 EndWeight 28-30;
run;

In this case, the INFILE statements stay the same for all operating environments. The FILENAME statement, however, can appear differently for different operating environments, as the following table shows:

Referencing Directories, PDSs, and MACLIBs in Various Operating Environments
Operating environment FILENAME statement example
z/OS filename fitclub 'fitness.weight.rawdata';
CMS filename fitclub 'use1 maclib';  (table note 1)
OpenVMS filename fitclub '[fitness.weight.rawdata]';
UNIX filename fitclub '/usr/local/fitness';
Windows filename fitclub 'c:\fitness';

TABLE NOTE 1:   Under CMS, the external file must be a CMS MACLIB, a CMS TXTLIB, or a z/OS PDS. [arrow]

Previous Page | Next Page | Top of Page