
SAS Publishing's web column, "Tipped Off" may give you that feeling from time to time. "Tipped Off" brings you a variety of proven tips and techniques from your favorite Books By Users Press authors.
Take a look at all the tips that have already been featured - BBU Tip Archive
Reading Messy Raw Data
By Lora D. Delwiche and Susan J. Slaughter
Sometimes you need to read data that just don't line up in nice columns or have predictable lengths. When you have these types of messy files, ordinary list, column, or formatted input simply aren't enough. You need more tools in your bag-tools like the @'character' column pointer and the colon modifier.
The @'character' column pointer
You can use the @n column pointer to move to a particular column before reading data.
However, sometimes you don't know which column a variable starts in, but you do know that it always comes after a
particular character or word. For these types of situations, you can use the @'character' column pointer. For example,
suppose you have a data file that has information about dog ownership. Nothing in the file lines up, but you know that
the breed of the dog always follows the word Breed:.
The colon modifier
The above INPUT statement will work just fine as long as the dog's breed is 8 characters or less (the default length for
a character variable). So if the dog is a Shepherd you're fine, but if the dog is a Rottweiler, all you will get is Rottweil.
You can assign the variable an informat, such as $20., in the INPUT statement, but then SAS will read 20 columns every time,
even when the value of DogBreed is shorter. Then the value may include unwanted characters which appear after the dog's breed
in the data line. If you only want SAS to read until it encounters a delimiter (a space is the default delimiter), you can
use a colon modifier with the informat. To use a colon modifier, simply put a colon (:) before the informat (:$20. instead of $20.).
| Statements | Value of variable DogBreed |
| INPUT @'Breed:' DogBreed $; | Rottweil |
| INPUT @'Breed:' DogBreed $20.; | Rottweiler Vet Bill |
| INPUT @'Breed:' DogBreed :$20.; | Rottweiler |
Example
Web logs are a good example of messy data. The following data lines are part of a web log for a dog care business website.
The data lines start with the IP address of the computer accessing the web page followed by other information including the
date the file was accessed and the filename.
130.192.70.235 - - [08/Jun/2004:23:51:32 -0700] "GET /rover.jpg HTTP/1.1" 200 66820 128.32.236.8 - - [08/Jun/2004:23:51:40 -0700] "GET /grooming.html HTTP/1.0" 200 8471 128.32.236.8 - - [08/Jun/2004:23:51:40 -0700] "GET /Icons/brush.gif HTTP/1.0" 200 89 128.32.236.8 - - [08/Jun/2004:23:51:40 -0700] "GET /H_poodle.gif HTTP/1.0" 200 1852 118.171.121.37 - - [08/Jun/2004:23:56:46 -0700] "GET /bath.gif HTTP/1.0" 200 14079 128.123.121.37 - - [09/Jun/2004:00:57:49 -0700] "GET /lobo.gif HTTP/1.0" 200 18312 128.123.121.37 - - [09/Jun/2004:00:57:49 -0700] "GET /statemnt.htm HTTP/1.0" 200 238 128.75.226.8 - - [09/Jun/2004:01:59:40 -0700] "GET /Icons/leash.gif HTTP/1.0" 200 98
We are interested in the date the files were accessed and the filename. You can see that because the IP address is not always the same number of characters in length, the date does not line up in the same column all the time. Also, not only does the filename not line up in columns, but the length of the filename is highly variable. Here is a SAS program that can read this file:
This INPUT statement uses @'[' to position the column pointer to read the date, then uses @'GET' to position the column pointer to read the filename. Because the filename is more than 8 characters long, but not always the same number of characters, an informat with a colon modifier :$20. is used to read the filename. Here are the results of this program:
Dog Care Web Logs 1
Obs AccessDate File
1 06/08/04 /rover.jpg
2 06/08/04 /grooming.html
3 06/08/04 /Icons/brush.gif
4 06/08/04 /H_poodle.gif
5 06/08/04 /bath.gif
6 06/08/04 /lobo.gif
7 06/09/04 /statemnt.htm
8 06/09/04 /Icons/leash.gif
|
In this case, SAS knew that it had reached the end of the filename when it reached the default delimiter-a space. This method also works for files with other delimiters such as commas or tabs. To specify other delimiters, add the DLM= and DSD options to your INFILE statement.
About the Authors
Lora D. Delwiche and Susan J. Slaughter are best known as the authors of The Little SAS Book,
which is published by SAS. They have presented many papers at local, regional, and national SAS conferences.
Their book is available from the online bookstore.
Keep up with our current and new tips at http://support.sas.com/publishing/tip.html.
For other tips, please visit our Tip Archive