SAS Institute. The Power to Know

Learning Center

Send e-mail
Call us


google graphic

Full Text Search
SAS Publishing Titles

Related Links

Stay Informed

I Should Have Thought of That!


tip graphic

SAS Publishing's web column, "Tipped Off" may give you that feeling from time to time. "Tipped Off" brings you a variety of proven tips and techniques from your favorite Books By Users Press authors.

Take a look at all the tips that have already been featured - BBU Tip Archive


Reading Messy Raw Data 
By Lora D. Delwiche and Susan J. Slaughter

  Sometimes you need to read data that just don't line up in nice columns or have predictable lengths. When you have these types of messy files, ordinary list, column, or formatted input simply aren't enough. You need more tools in your bag-tools like the @'character' column pointer and the colon modifier.

The @'character' column pointer
You can use the @n column pointer to move to a particular column before reading data. However, sometimes you don't know which column a variable starts in, but you do know that it always comes after a particular character or word. For these types of situations, you can use the @'character' column pointer. For example, suppose you have a data file that has information about dog ownership. Nothing in the file lines up, but you know that the breed of the dog always follows the word Breed:.

Given this line of raw data,
My dog Kiah Breed: Shepherd Vet Bills: $325

You could read the dog's breed using the following INPUT statement:
INPUT @'Breed:' DogBreed $;

The colon modifier
The above INPUT statement will work just fine as long as the dog's breed is 8 characters or less (the default length for a character variable). So if the dog is a Shepherd you're fine, but if the dog is a Rottweiler, all you will get is Rottweil. You can assign the variable an informat, such as $20., in the INPUT statement, but then SAS will read 20 columns every time, even when the value of DogBreed is shorter. Then the value may include unwanted characters which appear after the dog's breed in the data line. If you only want SAS to read until it encounters a delimiter (a space is the default delimiter), you can use a colon modifier with the informat. To use a colon modifier, simply put a colon (:) before the informat (:$20. instead of $20.).

For example, given this line of raw data,
My dog Sam Breed: Rottweiler Vet Bills: $478

the following table shows the results you would get using different INPUT statements:
Statements Value of variable DogBreed
INPUT @'Breed:' DogBreed $; Rottweil
INPUT @'Breed:' DogBreed $20.; Rottweiler Vet Bill
INPUT @'Breed:' DogBreed :$20.; Rottweiler

Example
Web logs are a good example of messy data. The following data lines are part of a web log for a dog care business website. The data lines start with the IP address of the computer accessing the web page followed by other information including the date the file was accessed and the filename.

130.192.70.235 - - [08/Jun/2004:23:51:32 -0700] "GET /rover.jpg HTTP/1.1" 200 66820
128.32.236.8 - - [08/Jun/2004:23:51:40 -0700] "GET /grooming.html HTTP/1.0" 200 8471
128.32.236.8 - - [08/Jun/2004:23:51:40 -0700] "GET /Icons/brush.gif HTTP/1.0" 200 89
128.32.236.8 - - [08/Jun/2004:23:51:40 -0700] "GET /H_poodle.gif HTTP/1.0" 200 1852
118.171.121.37 - - [08/Jun/2004:23:56:46 -0700] "GET /bath.gif HTTP/1.0" 200 14079
128.123.121.37 - - [09/Jun/2004:00:57:49 -0700] "GET /lobo.gif HTTP/1.0" 200 18312
128.123.121.37 - - [09/Jun/2004:00:57:49 -0700] "GET /statemnt.htm HTTP/1.0" 200 238
128.75.226.8 - - [09/Jun/2004:01:59:40 -0700] "GET /Icons/leash.gif HTTP/1.0" 200 98

We are interested in the date the files were accessed and the filename. You can see that because the IP address is not always the same number of characters in length, the date does not line up in the same column all the time. Also, not only does the filename not line up in columns, but the length of the filename is highly variable. Here is a SAS program that can read this file:

DATA weblogs;
INFILE 'c:\MyWebLogs\dogweblogs.txt';
INPUT @'[' AccessDate DATE11. @'GET' File :$20.;
PROC PRINT;
TITLE 'Dog Care Web Logs';
FORMAT AccessDate MMDDYY8.;
RUN;

This INPUT statement uses @'[' to position the column pointer to read the date, then uses @'GET' to position the column pointer to read the filename. Because the filename is more than 8 characters long, but not always the same number of characters, an informat with a colon modifier :$20. is used to read the filename. Here are the results of this program:

                              Dog Care Web Logs                          1

                         Obs    AccessDate    File

                          1     06/08/04      /rover.jpg
                          2     06/08/04      /grooming.html
                          3     06/08/04      /Icons/brush.gif
                          4     06/08/04      /H_poodle.gif
                          5     06/08/04      /bath.gif
                          6     06/08/04      /lobo.gif
                          7     06/09/04      /statemnt.htm
                          8     06/09/04      /Icons/leash.gif
						 

In this case, SAS knew that it had reached the end of the filename when it reached the default delimiter-a space. This method also works for files with other delimiters such as commas or tabs. To specify other delimiters, add the DLM= and DSD options to your INFILE statement.


About the Authors
Lora D. Delwiche and Susan J. Slaughter are best known as the authors of The Little SAS Book, which is published by SAS. They have presented many papers at local, regional, and national SAS conferences.

Their book is available from the online bookstore.


Keep up with our current and new tips at http://support.sas.com/publishing/tip.html.

For other tips, please visit our Tip Archive