SAS Institute. The Power to Know

SAS(R) 9.2 Language Reference: Dictionary

Previous Page | Next Page

Functions and CALL Routines

Pattern Matching Using Perl Regular Expressions (PRX)


Definition of Pattern Matching

Pattern matching enables you to search for and extract multiple matching patterns from a character string in one step, as well as to make several substitutions in a string in one step. You do this by using the PRX functions and CALL routines in the DATA step.


Definition of Perl Regular Expression (PRX) Functions and CALL Routines

Perl regular expression (PRX) functions and CALL routines refers to a group of functions and CALL routines that use a modified version of Perl as a pattern-matching language to parse character strings. You can do the following:

  • Search for a pattern of characters within a string.

  • Extract a substring from a string.

  • Search and replace text with other text.

  • Parse large amounts of text, such as Web logs or other text data.

Perl regular expressions comprise the character string matching category for functions and CALL routines. For a short description of these functions and CALL routines, see the Functions and CALL Routines by Category.


Benefits of Using Perl Regular Expressions in the DATA Step

Using Perl regular expressions in the DATA step enhances search-and-replace options in text. You can use Perl regular expressions to perform the following tasks:

  • Validate data.

  • Replace text.

  • Extract a substring from a string.

  • Write Perl debug output to the SAS log.

You can write SAS programs that do not use regular expressions to produce the same results as you do when you use Perl regular expressions. However, the code without the regular expressions requires more function calls to handle character positions in a string and to manipulate parts of the string.

Perl regular expressions combine most, if not all, of these steps into one expression. The resulting code is less prone to error, easier to maintain, and clearer to read.


Using Perl Regular Expressions in the DATA Step


Perl Artistic License Compliance

Perl regular expressions are supported beginning with SAS®9.

The PRX functions use a modified version of Perl 5.6.1 to perform regular expression compilation and matching. Perl is compiled into a library for use with SAS. This library is shipped with SAS®9. The modified and original Perl 5.6.1 files are freely available in a ZIP file from http://support.sas.com/rnd/base. The ZIP file is provided to comply with the Perl Artistic License and is not required in order to use the PRX functions. Each of the modified files has a comment block at the top of the file describing how and when the file was changed. The executables were given nonstandard Perl names. The standard version of Perl can be obtained from http://www.perl.com.

Only Perl regular expressions are accessible from the PRX functions. Other parts of the Perl language are not accessible. The modified version of Perl regular expressions does not support the following items:

  • Perl variables (except the capture buffer variables $1 - $n, which are supported).

  • The regular expression options /c and /g, and the /e option with substitutions.

  • The regular expression option /o in SAS 9.0. (It is supported in SAS 9.1 and later.)

  • Named characters, which use the \N{name} syntax.

  • The metacharacters \pP, \PP, and \X.

  • Executing Perl code within a regular expression, which includes the syntax (?{code}), (??{code}), and (?p{code}).

  • Unicode pattern matching.

  • Using ?PATTERN?. ? is treated like an ordinary regular expression start and end delimiter.

  • The metacharacter \G.

  • Perl comments between a pattern and replacement text. For example: s{regexp} # perl comment {replacement} is not supported.

  • Matching backslashes with m/\\\\/. Instead use m/\\/ to match a backslash.


Syntax of Perl Regular Expressions

Perl regular expressions consist of characters and special characters that are called metacharacters. When performing a match, SAS searches a source string for a substring that matches the Perl regular expression that you specify. Using metacharacters enables SAS to perform special actions when searching for a match:

  • If you use the metacharacter \d, SAS matches a digit between 0-9.

  • If you use /\dt/, SAS finds the digits in the string "Raleigh, NC 27506".

  • If you use /world/, SAS finds the substring "world" in the string "Hello world!".

You can see lists of PRX metacharacters in Tables of Perl Regular Expression (PRX) Metacharacters.


Example 1: Validating Data

You can test for a pattern of characters within a string. For example, you can examine a string to determine whether it contains a correctly formatted telephone number. This type of test is called data validation.

The following example validates a list of phone numbers. To be valid, a phone number must have one of the following forms: (XXX) XXX-XXXX or XXX-XXX-XXXX .

data _null_;  1 
   if _N_ = 1 then 
      do;  
         paren = "\([2-9]\d\d\) ?[2-9]\d\d-\d\d\d\d";  2 
         dash = "[2-9]\d\d-[2-9]\d\d-\d\d\d\d";  3 
         expression = "/(" || paren || ")|(" || dash || ")/";  4 
         retain re; 
         re = prxparse(expression);  5 
         if missing(re) then  6 
            do;
               putlog "ERROR: Invalid expression " expression;  7 
               stop;
            end;     
      end; 

   length first last home business $ 16;
   input first last home business;

   if ^prxmatch(re, home) then  8 
      putlog "NOTE: Invalid home phone number for " first last home;

   if ^prxmatch(re, business) then  9 
      putlog "NOTE: Invalid business phone number for " first last business;

   datalines;   
Jerome Johnson (919)319-1677 (919)846-2198 
Romeo Montague 800-899-2164 360-973-6201
Imani Rashid (508)852-2146 (508)366-9821 
Palinor Kent . 919-782-3199
Ruby Archuleta . . 
Takei Ito 7042982145 .
Tom Joad 209/963/2764 2099-66-8474
;
run;

The following items correspond to the lines that are numbered in the DATA step that is shown above.

[1] Create a DATA step.

[2] Build a Perl regular expression to identify a phone number that matches (XXX)XXX-XXXX, and assign the variable PAREN to hold the result. Use the following syntax elements to build the Perl regular expression:

\(

matches the open parenthesis in the area code.

[2-9]

matches the digits 2-9, which is the first number in the area code.

\d

matches a digit, which is the second number in the area code.

\d

matches a digit, which is the third number in the area code.

\)

matches the closed parenthesis in the area code.

?

matches the space (which is the preceding subexpression) zero or one time. Spaces are significant in Perl regular expressions. They match a space in the text that you are searching. If a space precedes the question mark metacharacter (as it does in this case), the pattern matches either zero spaces or one space in this position in the phone number.

[3] Build a Perl regular expression to identify a phone number that matches XXX-XXX-XXXX, and assign the variable DASH to hold the result.

[4] Build a Perl regular expression that concatenates the regular expressions for (XXX)XXX-XXXX and XXX--XXX--XXXX. The concatenation enables you to search for both phone number formats from one regular expression.

The PAREN and DASH regular expressions are placed within parentheses. The bar metacharacter (|) that is located between PAREN and DASH instructs the compiler to match either pattern. The slashes around the entire pattern tell the compiler where the start and end of the regular expression is located.

[5] Pass the Perl regular expression to PRXPARSE and compile the expression. PRXPARSE returns a value to the compiled pattern. Using the value with other Perl regular expression functions and CALL routines enables SAS to perform operations with the compiled Perl regular expression.

[6] Use the MISSING function to check whether the regular expression was successfully compiled.

[7] Use the PUTLOG statement to write an error message to the SAS log if the regular expression did not compile.

[8] Search for a valid home phone number. PRXMATCH uses the value from PRXPARSE along with the search text and returns the position where the regular expression was found in the search text. If there is no match for the home phone number, the PUTLOG statement writes a note to the SAS log.

[9] Search for a valid business phone number. PRXMATCH uses the value from PRXPARSE along with the search text and returns the position where the regular expression was found in the search text. If there is no match for the business phone number, the PUTLOG statement writes a note to the SAS log.

The following lines are written to the SAS log:

NOTE: Invalid home phone number for Palinor Kent  
NOTE: Invalid home phone number for Ruby Archuleta  
NOTE: Invalid business phone number for Ruby Archuleta  
NOTE: Invalid home phone number for Takei Ito 7042982145
NOTE: Invalid business phone number for Takei Ito  
NOTE: Invalid home phone number for Tom Joad 209/963/2764
NOTE: Invalid business phone number for Tom Joad 2099-66-8474


Example 2: Replacing Text

You can use Perl regular expressions to find specific characters within a string. You can then remove the characters or replace them with other characters. In this example, the two occurrences of the less-than character (<) are replaced by &lt; and the two occurrences of the greater-than character (>) are replaced by &gt;.

data _null_;  1   
   if _N_ = 1 then        
      do;
         retain lt_re gt_re;  
         lt_re = prxparse('s/</&lt;/');  2 
         gt_re = prxparse('s/>/&gt;/');  3 
         if missing(lt_re) or missing(gt_re) then  4 
            do;        
               putlog "ERROR: Invalid regexp.";  5 
               stop;
            end;
      end;
   input;
   call prxchange(lt_re, -1, _infile_);  6 
   call prxchange(gt_re, -1, _infile_);  7 
   put _infile_;
   datalines4; 
The bracketing construct ( ... ) creates capture buffers. To refer to 
the digit'th buffer use \<digit> within the match. Outside the match 
use "$" instead of "\". (The \<digit> notation works in certain 
circumstances outside the match. See the warning below about \1 vs $1 
for details.) Referring back to another part of the match is called 
backreference.
;;;;

The following items correspond to the numbered lines in the DATA step that is shown above.

[1] Create a DATA step.

[2] Use metacharacters to create a substitution syntax for a Perl regular expression, and compile the expression. The substitution syntax specifies that a less-than character (<) in the input is replaced by the value &lt; in the output.

[3] Use metacharacters to create a substitution syntax for a Perl regular expression, and compile the expression. The substitution syntax specifies that a greater-than character (>) in the input is replaced by the value &gt; in the output.

[4] Use the MISSING function to check whether the Perl regular expression compiled without error.

[5] Use the PUTLOG statement to write an error message to the SAS log if neither of the regular expressions was found.

[6] Call the PRXCHANGE routine. Pass the LT_RE pattern-id, and search for and replace all matching patterns. Put the results in _INFILE_ and write the observation to the SAS log.

[7] Call the PRXCHANGE routine. Pass the GT_RE pattern-id, and search for and replace all matching patterns. Put the results in _INFILE_ and write the observation to the SAS log.

The following lines are written to the SAS log:

The bracketing construct ( ... ) creates capture buffers. To refer to
the digit'th buffer use \&lt;digit&gt; within the match. Outside the match
use "$" instead of "\". (The \&lt;digit&gt; notation works in certain
circumstances outside the match. See the warning below about \1 vs $1
for details.) Referring back to another part of the match is called a
backreference.


Example 3: Extracting a Substring from a String

You can use Perl regular expressions to find and easily extract text from a string. In this example, the DATA step creates a subset of North Carolina business phone numbers. The program extracts the area code and checks it against a list of area codes for North Carolina.

data _null_;  1 
   if _N_ = 1 then 
      do; 
         paren = "\(([2-9]\d\d)\) ?[2-9]\d\d-\d\d\d\d";  2 
         dash = "([2-9]\d\d)-[2-9]\d\d-\d\d\d\d";  3 
         regexp = "/(" || paren || ")|(" || dash || ")/";  4 
         retain re; 
         re = prxparse(regexp);  5 
         if missing(re) then  6 
            do;
               putlog "ERROR: Invalid regexp " regexp;  7 
               stop;
            end;     
 
         retain areacode_re;
         areacode_re = prxparse("/828|336|704|910|919|252/");  8 
         if missing(areacode_re) then 
            do;
               putlog "ERROR: Invalid area code regexp";
               stop;
            end;
      end; 

   length first last home business $ 16;
   length areacode $ 3;
   input first last home business;

   if ^prxmatch(re, home) then  
      putlog "NOTE: Invalid home phone number for " first last home;

   if prxmatch(re, business) then  9 
      do;
         which_format = prxparen(re);  10 
         call prxposn(re, which_format, pos, len);  11 
         areacode = substr(business, pos, len);  
         if prxmatch(areacode_re, areacode) then  12 
            put "In North Carolina: " first last business;
      end;
      else
         putlog "NOTE: Invalid business phone number for " first last business;
   datalines; 
Jerome Johnson (919)319-1677 (919)846-2198 
Romeo Montague 800-899-2164 360-973-6201
Imani Rashid (508)852-2146 (508)366-9821 
Palinor Kent 704-782-4673 704-782-3199
Ruby Archuleta 905-384-2839 905-328-3892 
Takei Ito 704-298-2145 704-298-4738
Tom Joad 515-372-4829 515-389-2838
;

The following items correspond to the numbered lines in the DATA step that is shown above.

[1] Create a DATA step.

[2] Build a Perl regular expression to identify a phone number that matches (XXX)XXX-XXXX, and assign the variable PAREN to hold the result. Use the following syntax elements to build the Perl regular expression:

\(

matches the open parenthesis in the area code. The open parenthesis marks the start of the submatch.

[2-9]

matches the digits 2-9, which is the first number in the area code.

\d

matches a digit, which is the second number in the area code.

\d

matches a digit, which is the third number in the area code.

\)

matches the closed parenthesis in the area code. The closed parenthesis marks the end of the submatch.

?

matches the space (which is the preceding subexpression) zero or one time. Spaces are significant in Perl regular expressions. They match a space in the text that you are searching. If a space precedes the question mark metacharacter (as it does in this case), the pattern matches either zero spaces or one space in this position in the phone number.

[3] Build a Perl regular expression to identify a phone number that matches XXX-XXX-XXXX, and assign the variable DASH to hold the result.

[4] Build a Perl regular expression that concatenates the regular expressions for (XXX)XXX-XXXX and XXX--XXX--XXXX. The concatenation enables you to search for both phone number formats from one regular expression.

The PAREN and DASH regular expressions are placed within parentheses. The bar metacharacter (|) that is located between PAREN and DASH instructs the compiler to match either pattern. The slashes around the entire pattern tell the compiler where the start and end of the regular expression is located.

[5] Pass the Perl regular expression to PRXPARSE and compile the expression. PRXPARSE returns a value to the compiled pattern. Using the value with other Perl regular expression functions and CALL routines enables SAS to perform operations with the compiled Perl regular expression.

[6] Use the MISSING function to check whether the Perl regular expression compiled without error.