|
Reading Delimited Text Files Into SAS Since the release of Version 6, some new options are available that many users are not aware of. This paper is intended to explain some of the new options, as well as demonstrate the use of older options that might be helpful. When reading text files with the DATA step, two statements are used: the INFILE statement and the INPUT statement. The INFILE statement is used to specify the physical file being read, as well as any options pertaining to the file. You can use a FILENAME statement in conjunction with an INFILE statement, as such: FILENAME myfile "C:\sasfiles\testfile.csv"; DATA A; INFILE MYFILE; or you can specify the fully qualified path to the file on the INFILE statement, as such: INFILE "C:\ sasfiles\testfile.txt"; The set of options explained below are used on the INFILE statement: DELIMITER= (DLM=) - This option allows you to tell SAS what character is used as a delimiter in a file. If this option is not specified, SAS assumes the delimiter is a space. Some common delimiters are comma, vertical pipe, semi-colon, and tab. The syntax for the option would be as follows: DLM=’char’ where char is equal to the character used as the delimiter in the text file. Since there is no key available for the tab character, you need to use the hexadecimal representation for tab. On ASCII based platforms, this is defined as ‘09’x, and the representation is ‘05’x on EBCDIC platforms. The syntax would be as follows: DLM=’09’x DSD - It has three functions when reading delimited files. The first function is to strip off any quotes that surround values in the text file. The second function deals with missing values. When SAS encounters consecutive delimiters in a file, the default action is to treat the delimiters as one unit. If a file has consecutive delimiters, it’s usually because there are missing values between them. DSD tells SAS to treat consecutive delimiters separately; therefore, a value that is missing between consecutive delimiters will be read as a missing value when DSD is specified. The third function assumes the delimiter is a comma. If DSD is specified and the delimiter is a comma, the DLM= option is not necessary. If another delimiter is used, the DLM= option must be used as well. This option became available in SAS 6.07 and is documented in "SAS Technical Report P-222". In Version 7 and beyond, DSD is documented in "SAS Language Reference: Dictionary". TRUNCOVER- By default, if SAS reads past the end of a record and has not satisfied the INPUT statement, it continues to read data from the next record. This action is called FLOWOVER. MISSOVER is an option that overrides the default action of FLOWOVER, and causes a variable to be set to missing if the INPUT statement is unable to read the entire value. TRUNCOVER is similar to MISSOVER, since it also overrides the default action of FLOWOVER. However, instead of setting a variable’s value to missing, TRUNCOVER writes whatever characters it is able to read to the appropriate variable, so you know what the input record contained. This option became available in SAS 6.07 and is documented in "SAS Technical Report P-222". In Version 7 and beyond, TRUNCOVER is documented in "SAS Language Reference: Dictionary". LRECL= - This option should be used when the length of the records in a file are greater than 256 bytes (on ASCII platforms). The input buffer is 256 bytes by default, and records that are greater than this length will be truncated when they are read. Setting the LRECL= option to a greater length assures that the input buffer is long enough to read the whole record. FIRSTOBS= - This option indicates that you want to start reading the input file at the record number specified, rather than beginning with the first record. This option is helpful when reading files that contain a header record, since a user can specify FIRSTOBS=2 and skip the header record entirely. The following is what a typical INFILE statement may look like: INFILE "C:\sasfiles\testfile.csv" DLM=’09’x DSD LRECL=1000 TRUNCOVER FIRSTOBS=2; The INPUT statement is used to list the variables you want to read from the file. For ex: DATA A; INFILE "C:\ sasfiles\testfile.csv"; INPUT VAR1 VAR2 VAR3; RUN; INPUTTING DATA: If your file contains numeric variables or character variables with a length of 8 bytes or less, then you don’t have to worry about using informats to read in the data (although you may still want to use them if you are reading data such as date or time values, etc.). You would just need to denote any character variables with a dollar sign ($). For ex. DATA A; INFILE ‘C:\sas\example1.csv’; INPUT FNAME $ LNAME $ AGE; RUN; In the example above, the variables FNAME and LNAME are character, so they need a $ after each variable name. The variable AGE is numeric, so nothing else is needed. All three variables will have a length of 8 bytes, since no informats were used. When reading delimited text files, SAS uses the delimiter to determine where one field ends and another begins. Because of this, list input is required. As in the previous example, list input simply lists the variables. However, if your character variables are longer than 8 bytes or if you have values such as dates, times, packed decimal, etc., you need to use an informat and specify a width. This is called formatted input. Using formatted input to read a delimited file can cause problems, because SAS ignores the delimiter and reads the number of bytes specified by the informat. This problem can be corrected by using the colon (:) format modifier, which specifies modified list input. The format modifier tells SAS to read up to the maximum number of bytes specified in the informat, OR until it reaches a delimiter. The delimiter is not ignored. For ex. DATA A; INFILE ‘C:\sas\example2.csv’ DLM=’,’ TRUNCOVER; INPUT FNAME :$12. LNAME :$20. AGE DATE :MMDDYY10.; RUN; To sum it all up, here is an example of a program that reads in a typical comma delimited file. DATA A; INFILE ‘C:\sas\example3.csv’ DLM=’,’ DSD LRECL=1000 TRUNCOVER; INPUT FNAME :$12. LNAME :$20. AGE DATE :MMDDYY10.; RUN; OR: FILENAME MYFILE ‘C:\sas\example3.csv’; DATA A; INFILE MYFILE DLM=’,’ DSD LRECL=1000 TRUNCOVER; INPUT FNAME :$12. LNAME :$20. AGE DATE :MMDDYY10.; RUN; TROUBLESHOOTING: If you are having trouble reading a delimited file and aren’t sure which of the options above you might need, the SAS log may hold the clues to solving the problem.
Problem 1 The maximum record length was 256.
One or more lines were truncated.
By default, the size of the input buffer on PC’s is 256 bytes, and records that are greater than this length will be truncated when they are read into SAS.
If you see these notes in the log, increasing the input buffer with the LRECL option will correct the problem.
Problem 2 NOTE: LOST CARD.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
If you see either one or both of these notes in the log, adding the TRUNCOVER or MISSOVER option to the INFILE statement will correct the problem.
While the second note will always be written to the log in the above scenario, the first note will only be written to the log when SAS has read the last record in the file and tries to read another record to satisfy the INPUT statement.
Hopefully these tips will make reading delimited text files a lot clearer!
|