FOCUS AREAS

Perl regular expressions

Base SAS

Example: Parsing Delimited Data That INFILE DSD Cannot Parse

Perl Regular Expressions are supported beginning with SAS®9.


This example parses data that INFILE with DSD cannot. The data is comma separated, but some of the values contain the delimiter. If the value contains the delimiter, it is surrounded by asteriks. The quote character used by DSD cannot be changed, so DSD cannot be used in this case. A regexp can be used, as is done in this example.

The example first creates a file to parse. Then the file is read and an unsuccessful attempt to use DSD to read the file is made. The next step uses traditional SAS programming to parse the data, but the code is difficult to write and error prone. The last step uses a regexp to parse the line in a simple and convenient way.

filename temp temp;

/* Create the file that we'll parse up in following data steps.     */
/* The file format will have the author, a comma, then the quote,   */
/*  another comma, then the relevance of the quote as a number.     */
/* So that quote characters in the quote don't get double quoted,   */
/*  we know there will never be an asterisk in the quote, so place  */
/*  an asterisk on each end of the quote.                           */
data _null_;
   length author $ 16 quote $ 80 relevance 8;
   input author $16.;
   input quote $80.;
   input relevance;

   file temp dlm=",";
   put author quote relevance;
datalines;
Lily Tomlin
*The trouble with the rat race is that even if you win, you're still a rat.*
4
Will Rogers
*Our constitution protects aliens, drunks and U.S. Senators.*
2
Suzanne Necker
*Fortune does not change men, it unmasks them.*
1
Dan Quayle
*Verbosity leads to unclear, inarticulate things.*
3
;

/* Let's look at the file we need to parse. */
data _null_;
   infile temp;
   input;
   put _infile_;
run;

/* Try to use DSD to read the file, but cannot since DSD doesn't          */
/*  realize the delimiter is bounded by asterisks instead of quote chars. */
/* So, DSD will stop on the first comma in the quote.  Not good.          */
data _null_;
   length author $ 16 quote $ 80 relevance 8;
   infile temp dsd;
   input author quote relevance;
   put relevance author $16. quote;
run;

/* The easiest way now might be to parse the line up with functions. */
/* Thought and trial and error must go into keeping track of where   */
/*  particular characters are located.  This took some time to get   */
/*  right.                                                           */
data _null_;
   length author $ 16 quote $ 80 relevance 8;
   infile temp;
   input;
   pos1 = indexc(_infile_, ",");
   author = substr(_infile_, 1, pos1-1);
   pos2 = indexc(substr(_infile_,pos1+2), "*");
   quote = substr(_infile_, pos1+2, pos2-1);
   relevance = input(substr(_infile_, pos2+pos1+3), BEST12.);
   put relevance author $16. quote;
run;

/* A better solution is to use a regular expression (regexp).      */
/* The regexp can locate where particular pieces of the data       */
/*  are based on a pattern and we can later query the match        */
/*  to find where the submatches occurred.                         */
/* The regexp used is broken up below.  The best way to understand */
/*  a regexp is to break it into its parts.                        */
/*
   (.+)        Look for one or more characters
   ,           a comma
   \*          an asterisk
   (.+)        one more more chars
   \*          an asterisk
   ,           then a comma
   (\d+)       then one or more digits
*/
/* Grouping items with parens enables use to use PRXPOSN to query  */
/*  where the submatch matched in the string and its length.       */
/* This solution took less time to write than the prior one.       */
data _null_;
   retain re;

   length author $ 16 quote $ 80 relevance 8;
   infile temp;
   input;

   if _N_ = 1 then
      re = prxparse("/(.+),\*(.+)\*,(\d+)/");

   if prxmatch(re, _infile_ ) then do;
      call prxposn(re, 1, pos, len);
      author = substr(_infile_, pos, len);
      call prxposn(re, 2, pos, len);
      quote = substr(_infile_, pos, len);
      call prxposn(re, 3, pos, len);
      relevance = input(substr(_infile_, pos, len), BEST12.);

      put relevance author $16. quote;
   end;
   else do;
      putlog "ERROR: Couldn't find valid quote for line:";
      putlog "ERROR- " _INFILE_;
   end;
run;

filename temp clear;

Your Turn

The developers, testers and documentation folk that bring you Base SAS Software are very excited about the potential of these new capabilities of the SAS System. You can send electronic mail to Base.Research@sas.com with your comments.