![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
A macro, %SASGREP, has been written to manage the process of searching a directory structure for text strings contained in files. It has capabilities similar to the UNIX GREP (Global Regular Expression Print) utility from which it derives its name. It uses the Perl regular expression (regex) enhancements to the SAS DATA step available in SAS®9. The implementation discussed below uses the DOS DIR command as implemented in Windows XP, but with minor changes, the strategy employed can be used on other Windows operating systems, or on UNIX-based systems.
A two-step approach is used in %SASGREP. In the first step, a list of files is produced by using the pipe device type on a FILENAME statement to extract the contents of a directory path. The path, filename, and descriptive information are stored in a SAS dataset. In the second step, the files in each directory are opened and read into a SAS character variable, line-by-line. Each line is parsed and scanned using the Perl regex functions PRXPARSE and PRXMATCH. If there is a match found in the line scanned, the entire line and descriptive information are stored in a SAS dataset. Based on %SASGREP control flags, the matched lines and descriptive information are printed to the list file.
The macro declaration is given below.
%macro SASGREP ( PATH /* Windows path to directory to search */ , PRX /* Perl regex used to match strings */ , CASE= /* [optional] case-sensitivity flag */ , LS=120 /* [optional] line size for printing results */ , OUT= /* [optional] name of dataset containing results */ , PRINT= /* [optional] switches for printing */ , REPORT=Y /* [optional] switch to create results of search */ , SUBDIR=Y /* [optional] switch to search subdirectories */ |
See the Full Code tab for the entire code, and the Downloads tab for a link to download the SAS program.
If CASE=I, then a case-insensitive search will be performed. In this case (sorry), it will not matter whether or not the line in the file is UPPER-CASE or lower-case or Mixed-Case. The line of characters read from the file will be matched against the regex. When CASE=I, you will get the largest set of matches. An alternative way of performing a case-insensitive search is to use the regex construction /pattern to be matched/i
. Appending the "i" to the regex will produce case-insensitive searching. If you use this method for case-insensitive searching, do not use CASE=I.
The line size parameter, LS, can be set to contain the search results for convenient formatting. For example, given 1" left and right margins for a standard 8 ½ x 11 sheet of paper, LS=97 produced the results shown below.
The PRINT flag controls printing of descriptive information, as specified below.
If you wanted to print all of the descriptive information, you would specify PRINT=DNPST. The filename and line containing the text string matched are always printed.
The macro will take varying amounts of time to execute based on the path. If you choose c:\
or some other inclusive set of files, you must be prepared to wait a while for processing to be completed. If your path does not contain many subdirectories or if you specify SUBDIR=N (no recursive processing of subdirectories), then execution time will probably be short.
The regex will probably, in the majority of cases, be a simple string, e.g., /simple search string/
, but may be any regex suitable for parsing by RXPARSE. Regular expressions are explained in detail in the Functions and CALL Routines section of the Dictionary of Language Elements, which is contained in the SAS Language Dictionary. Additional information may be obtained by searching the World Wide Web for Perl regular expression.
Here is an example in which the goal of the search was to find all of the SAS programs that contained the string pipe
in the directory containing the %SASGREP macro. A case-insensitive search was performed, and the full set of descriptive information was printed. The macro invocation is given below.
%SASGREP( c:\Home\My SAS Files\9.1\SASGREP\*.sas, /pipe/i, ls=97, print=dnpst ) |
A sample of the results of the search is given below.
%SASGREP Listing Perl Regular Expression Search Using Regexp=/pipe/i Line Directory File Date Time Size # Line c:\Home\My SAS sasgrep.sas 032504 10:03 11,772 41 * use Windows pipe with file reference to execute 'dir' Files\9.1\RandD\ command to obtain directory SASGREP contents sasgrep.sas 032504 10:03 11,772 42 * parse pipe output as if it were a file to extract file names, other info testSASGREP.sas 081204 13:20 326 5 %SASGREP( c:\# Home\My SAS Files\9.1\RandD\SASGREP\*.sas, /pipe/, case=i, ls=97, print=dnpst )
My thanks to Jason Secosky of SAS, who generously agreed to review this article and whose comments improved %SASGREP.
About the Author
Ross Bettinger is a SAS Analytical Consultant. He provides support for Enterprise Miner and has been involved with data mining projects for 9 years. He has been a SAS user for 17 years. His professional interests are related to data mining, statistical analysis of data, feature selection and transformation, model building, and algorithm development.
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.
%macro SASGREP( PATH /* Windows path to directory the files of which to search */
, PRX /* Perl regular expression used to match strings in files */
, CASE= /* [optional] case-sensitivity flag */
, LS=120 /* [optional] line size for printing results of match */
, OUT= /* [optional] name of SAS dataset containing results of match */
, PRINT= /* [optional] switches to print date, time, line # of matched lines */
, REPORT=Y /* [optional] switch to create results of search */
, SUBDIR=Y /* [optional] switch to search subdirectories of &PATH */
) ;
/* PURPOSE: create SAS version of UNIX Global Regular Expression Print (GREP) utility
*
* NOTE: %SASGREP is designed to be run on SAS installations using the Windows XP operating system
*
* NOTE: &PATH must contain valid Windows path, e.g., 'c:' or 'c:\documents and settings'
*
* NOTE: &PRX must be a Perl regular expression (PRX). in the nominal case where a simple string
* search is to be performed, &PRX may consist simply of the delimited character string sought,
* e.g., /Text to be found/, and no other PRX metacharacters.
*
* the syntax of a PRX may be found in the Functions and CALL Routines section of the
* Dictionary of Language Elements, which is contained in the SAS Language Dictionary
*
* NOTE: &CASE controls the sensitivity to case (upper-case different from lower-case)
* default: upper/lower case is important, e.g., "Case" ^= "CASE" ^= "case"
* I ::= case-insensitive search will be performed
* PRX metacharacters will be added to simple search string
*
* NOTE: &PRINT controls printing of date file last written, time file last written,
* line # in file of match, size of file
* D ::= print date file last written
* N ::= print line # of line matched relative to start of file
* P ::= print directory path containing file in which string was matched
* S ::= print size of file in bytes
* T ::= print time file last written
*
* NOTE: if &SUBDIR = Y then all subdirectories of &PATH will be searched
* otherwise, only the path named in &PATH will be searched
*
* ALGORITHM:
* use Windows pipe with file reference to execute 'dir' command to obtain directory contents
* parse pipe output as if it were a file to extract file names, other info
* output complete path name to dataset 'filenames'
* apply Perl regular expression pattern matching to files in 'filenames'
* write successful matches to dataset &DSNOUT ( = &OUT if requested )
* report successful matches ( if requested )
*
* EXAMPLES OF USE:
* *** simple string search for "SAS" ***
* %SASGREP( c:/documents and settings/user_name/My SAS Files, /SAS/ )
*
* *** simple string search for "sasgrep" ***
* %SASGREP( c:/my docs/*.txt, /sasgrep/, out=sasgrep_out, report=n, print=dnst )
*
* *** simple case-insensitive string search for "sasgrep" ***
* %SASGREP( c:/my docs/*.sas, /sasgrep/, case=i, out=sasgrep_out, report=n, print=dnst )
*
* *** simple case-insensitive string search for "Perl regular expression" ***
* %SASGREP( c:/path, /Perl regular expression/, case=i, out=search_results, report=n, print= )
*
* *** string search for telephone number in format '(nnn)nnn-nnn' or '(nnn) nnn-nnn' ***
* %let PRX = /\([2-9]\d\d\) ?[2-9]\d\d-\d\d\d\d/ ;
* %SASGREP( c:/path, &PRX, case=i, out=search_results, report=n, print= )
*/
/* verify syntax of &PRX. if error, exit the macro */
data _null_ ;
prx = prxparse( "&PRX" ) ;
call symput( 'PRXPARSE_ERROR', put( _error_, 1. )) ;
run ;
%if &PRXPARSE_ERROR %then %goto L9999 ;
%let DELIM = ' ' ;
%let CASE = %eval( %upcase( "&CASE" ) = "I" ) ;
%if %length( &OUT ) > 0 %then %let DSNOUT = &OUT ; %else %let DSNOUT = sasgrep ;
%if %length( &PRINT ) > 0
%then %do ;
%let PRINT = %upcase( &PRINT ) ;
%let P_DATE = %eval( %index( &PRINT, D ) > 0 ) ;
%let P_NUM = %eval( %index( &PRINT, N ) > 0 ) ;
%let P_PATH = %eval( %index( &PRINT, P ) > 0 ) ;
%let P_SIZE = %eval( %index( &PRINT, S ) > 0 ) ;
%let P_TIME = %eval( %index( &PRINT, T ) > 0 ) ;
%end ;
%else
%do ; %let P_DATE = 0 ; %let P_NUM = 0 ; %let P_PATH = 0 ; %let P_SIZE = 0 ; %let P_TIME = 0 ; %end ;
%if %length( &REPORT ) > 0 %then %let REPORT = %upcase( &REPORT ) ; %else %let REPORT = N ;
%if %upcase( &SUBDIR ) = Y %then %let SUBDIR = /s ; %else %let SUBDIR = ;
%let SEARCHSAS = %eval( %index( &PATH, . ) > 0 ) ; /* flag to control parsing of filename for 'SAS' */
/*============================================================================*/
/* external storage references
/*============================================================================*/
/* run Windows "dir" DOS command as pipe to get contents of data directory */
filename DIRLIST pipe "dir /-c /q &SUBDIR /t:c ""&PATH""" ;
/*############################################################################*/
/* begin executable code
/*############################################################################*/
/* use Windows pipe to recursively find all files in &PATH
* parse out extraneous data, including unreadable directory paths
*
* directory list structure:
* "Directory of" record precedes listing of contents of directory:
*
* Directory of \ [ \ \... ]
* mm/dd/yy hh:mm:ss [AM|PM] ['' | size ] filename.type
*
* example:
*
* Volume in drive C is WXP
* Volume Serial Number is 18C2-3BAA
*
* Directory of C:\Documents and Settings\robett\My Documents\My SAS Files\V8\Test
*
* 05/21/03 10:58 AM CARYNT\robett .
* 05/21/03 10:58 AM CARYNT\robett ..
* 12/24/03 10:22 AM CARYNT\robett Codebook
* 04/23/01 02:42 PM 387 CARYNT\robett printCharMat.sas
* 10/09/03 11:35 AM 20582 CARYNT\robett test.log
* 10/28/03 08:02 AM 58682 CARYNT\robett test.lst
* 10/09/03 11:35 AM 1575 CARYNT\robett test.sas
*/
data filenames( keep= date dir_path filename size time ) ;
format date mmddyy8. time timeampm8. ;
length dir_path filename $256 temp $16 ;
retain dir_path prx ;
if _n_ = 1
then do ;
/* establish regex to parse input record for date, time, size, owner, filename
* regex matches
* (\d{2}\/d{2}\/\d{4})\s+ 'dd/mm/ccyy' and >= 1 white space
* (\d\d:\d\d (?:AM|PM))\s+ 'hh:mm AM' or 'hh:mm PM' and >= 1 white space
* (\d+)\s+ nnnnnnn and >= 1 white space
* (\S+)\s+ any character that is not white space and >= 1 white space
* (\S.*) any char that is not white space followed by . followed by >= 0 chars
*/
prx = prxparse('/(\d{2}\/\d{2}\/\d{4})\s+(\d\d:\d\d (?:AM|PM))\s+(\d+)\s+(\S+)\s+(\S.*)/') ;
end ;
infile dirlist ; /* use pipe to get filenames */
input ; /* read into_infile_ buffer. faster than reading into variable */
/* parse directory record for directory path
* parse non-directory record for filename, associated information
*/
if prxmatch( prx, _infile_ )
then do;
filename = prxposn( prx, 5, _infile_ ) ;
if filename in ( '.' '..' ) then delete ;
date = input( prxposn( prx, 1, _infile_ ), mmddyy10. ) ;
time = input( prxposn( prx, 2, _infile_ ), time8. ) ;
size = input( prxposn( prx, 3, _infile_ ), best. ) ;
/* bug in DOS DIR cmmd: if specify 'dir path\*.sas', get SAS datasets as well as SAS program files
* correct error by omitting observations containing SAS dataset names
*/
if &SEARCHSAS
then do ;
ndx = index( filename, '.' ) ;
if ndx > 0
then do;
temp = upcase( substr( filename, ndx + 1 )) ;
if temp =: 'SAS' & length( trim( temp )) > 3 then delete ;
end;
end ;
output ;
end ;
else
if upcase( scan( _infile_, 1, &DELIM )) = 'DIRECTORY'
then dir_path = left( substr( _infile_, length( "Directory of" ) + 2 )) ;
run ;
/* use path+file name to read external files, perform pattern recognition */
options nonotes ; /* turn off printing of NOTES: since each filename read from pipe is printed to log file */
data &DSNOUT ;
length file2read $256 line $32767 ;
set filenames ;
retain prx ;
file2read = catx( '\', dir_path, filename ) ;
infile dummy filevar=file2read end=lastobs length=reclen ;
lineno = 0 ; /* initialize line counter relative to file being read */
if _n_ = 1
then do ;
%if &CASE %then %let PRX = &PRX.i ; /* create regex for case-insensitive search */
prx = prxparse( "&PRX" ) ; /* initialize regular expression environment */
end ;
/* read from file2read, match regex to chars in line read from file. if match, output */
do while( not lastobs ) ;
input line $varying32767. reclen ;
lineno + 1 ;
if prxmatch( prx, line ) > 0 then output ;
end ;
drop prx ;
run ;
/*============================================================================*/
/* create report of files by owner, if requested
/* set PROC REPORT line column size adaptively according to items to be printed
/*============================================================================*/
/* subtract field width + 1 space character per optional descriptive item
* max of &LS chars/line - 1 - length( filename ) - 1 (print control char)
*/
%let LINE_SIZE = %eval( &LS - 17 - 1 - 1 - 17*&P_PATH - 7*&P_DATE - 6*&P_TIME - 8*&P_SIZE - 5 *&P_NUM ) ;
%if &REPORT = Y
%then %do ;
title1 '%SASGREP Listing' ;
title2 "Perl Regular Expression Search Using Regexp=&PRX" ;
proc report data=&DSNOUT headskip nocenter nowindows spacing=1 split='~' ;
column
%if &P_PATH %then dir_path ;
filename
%if &P_DATE %then date ;
%if &P_TIME %then time ;
%if &P_SIZE %then size ;
%if &P_NUM %then lineno ;
line
;
%if &P_PATH %then %str( define dir_path / order width=16 flow 'Directory' ; ) ;
define filename / display width=16 flow 'File' ;
%if &P_DATE %then %str( define date / display format=mmddyy6. 'Date' ; ) ;
%if &P_TIME %then %str( define time / display format=time5. 'Time' ; ) ;
%if &P_SIZE %then %str( define size / display format=comma7. 'Size' ; ) ;
%if &P_NUM %then %str( define lineno / display format=4. 'Line~#' ; ) ;
define line / display width=&LINE_SIZE flow 'Line' ;
run ;
title ;
%end ;
option notes ; /* restore printing of NOTES: messages to log file */
%L9999:
%mend SASGREP ;
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.
Type: | Sample |
Topic: | SAS Reference ==> Macro Third Party ==> Programming ==> Perl |
Date Modified: | 2006-01-11 03:03:01 |
Date Created: | 2005-05-12 10:46:19 |
Product Family | Product | Host | SAS Release | |
Starting | Ending | |||
SAS System | Base SAS | Microsoft® Windows® for 64-Bit Itanium-based Systems | 9 TS M0 | n/a |