FOCUS AREAS

Perl regular expressions

Base SAS

Perl Regular Expression Debug Support

Perl Regular Expressions are supported beginning with SAS®9.


Introduction

Information about how a regexp is compiled or the steps taken when matching a regexp against a character value could be valuable when a regexp does not do what you intended.

Perl produces regexp debugging output by using the -Dr switch on the command line or by using use re 'debug' withing a program. The data step provides this functionality with the PRXDEBUG call routine. CALL PRXDEBUG(1) turns on debug output and CALL PRXDEBUG(0) turns off debug output. All debug output is sent to the SAS log.

This paper will present an example of and describe Perl debug output. All of this material is taken from the perldebguts man page or at http://search.cpan.org/~gsar/perl-5.6.1/pod/perldebguts.pod#Debugging_regular_expressions.

Example Code

Perl debug output is turned on between the two calls to PRXDEBUG in the program below. The output produced by PRXPARSE and PRXMATCH will be discussed in the following sections.
data _null_;
   call prxdebug(1);
   putlog 'PRXPARSE:';
   re = prxparse('/[bc]d(ef*g)+h[ij]k$/');
   putlog 'PRXMATCH:';
   pos = prxmatch(re, 'abcdefg__gh__');
   call prxdebug(0);
run;

PRXPARSE Output

Compiling REx `[bc]d(ef*g)+h[ij]k$'
size 41 first at 1
rarest char g at 0
rarest char d at 0
   1: ANYOF[bc](10)
  10: EXACT <d>(12)
  12: CURLYX[0] {1,32767}(26)
  14:   OPEN1(16)
  16:     EXACT <e>(18)
  18:     STAR(21)
  19:       EXACT <f>(0)
  21:     EXACT <g>(23)
  23:   CLOSE1(25)
  25:   WHILEM[1/1](0)
  26: NOTHING(27)
  27: EXACT <h>(29)
  29: ANYOF[ij](38)
  38: EXACT <k>(40)
  40: EOL(41)
  41: END(0)
anchored `de' at 1 floating `gh' at 3..2147483647 (checking floating)
stclass `ANYOF[bc]' minlen 7

The first line shows the pre-compiled form of the regex. The second shows the size of the compiled form (in arbitrary units, usually 4-byte words) and the label id of the first node that does a match.

The last line (split into two lines above) contains optimizer information. In the example shown, the optimizer found that the match should contain a substring de at offset 1, plus substring gh at some offset between 3 and infinity. Moreover, when checking for these substrings (to abandon impossible matches quickly), Perl will check for the substring gh before checking for the substring de. The optimizer may also use the knowledge that the match starts (at the first id) with a character class, and the match cannot be shorter than 7 chars.

The fields of interest which may appear in the last line are

anchored STRING at POS
floating STRING at POS1..POS2
See above.
matching floating/anchored
Which substring to check first.
minlen
The minimal length of the match.
stclass TYPE
Type of first matching node.
noscan
Don't scan for the found substrings.
isall
Means that the optimizer info is all that the regular expression contains, and thus one does not need to enter the regex engine at all.
GPOS
Set if the pattern contains \G.
plus
Set if the pattern starts with a repeated char (as in x+y).
implicit
Set if the pattern starts with .*.
with eval
Set if the pattern contain eval-groups, such as (?{ code }) and (??{ code }).
anchored(TYPE)
If the pattern may match only at a handful of places, (with TYPE being BOL, MBOL, or GPOS. See the table below.

If a substring is known to match at end-of-line only, it may be followed by $, as in floating `k'$.

The optimizer-specific info is used to avoid entering (a slow) regex engine on strings that will not definitely match. If isall flag is set, a call to the regex engine may be avoided even when the optimizer found an appropriate place for the match.

The rest of the output contains the list of nodes of the compiled form of the regex. Each line has format

id: TYPE OPTIONAL-INFO (next-id)

Types of nodes

Here are the possible types, with short descriptions:

    # TYPE arg-description [num-args] [longjump-len] DESCRIPTION

    # Exit points
    END     no  End of program.
    SUCCEED no  Return from a subroutine, basically.

    # Anchors:
    BOL     no  Match "" at beginning of line.
    MBOL    no  Same, assuming multiline.
    SBOL    no  Same, assuming singleline.
    EOS     no  Match "" at end of string.
    EOL     no  Match "" at end of line.
    MEOL    no  Same, assuming multiline.
    SEOL    no  Same, assuming singleline.
    BOUND   no  Match "" at any word boundary
    BOUNDL  no  Match "" at any word boundary
    NBOUND  no  Match "" at any word non-boundary
    NBOUNDL no  Match "" at any word non-boundary
    GPOS    no  Matches where last m//g left off.

    # [Special] alternatives
    ANY     no  Match any one character (except newline).
    SANY    no  Match any one character.
    ANYOF   sv  Match character in (or not in) this class.
    ALNUM   no  Match any alphanumeric character
    ALNUML  no  Match any alphanumeric char in locale
    NALNUM  no  Match any non-alphanumeric character
    NALNUML no  Match any non-alphanumeric char in locale
    SPACE   no  Match any whitespace character
    SPACEL  no  Match any whitespace char in locale
    NSPACE  no  Match any non-whitespace character
    NSPACEL no  Match any non-whitespace char in locale
    DIGIT   no  Match any numeric character
    NDIGIT  no  Match any non-numeric character

    # BRANCH    The set of branches constituting a single choice are hooked
    #       together with their "next" pointers, since precedence prevents
    #       anything being concatenated to any individual branch.  The
    #       "next" pointer of the last BRANCH in a choice points to the
    #       thing following the whole choice.  This is also where the
    #       final "next" pointer of each individual branch points; each
    #       branch starts with the operand node of a BRANCH node.
    #
    BRANCH  node    Match this alternative, or the next...

    # BACK  Normal "next" pointers all implicitly point forward; BACK
    #       exists to make loop structures possible.
    # not used
    BACK    no  Match "", "next" ptr points backward.

    # Literals
    EXACT   sv  Match this string (preceded by length).
    EXACTF  sv  Match this string, folded (prec. by length).
    EXACTFL sv  Match this string, folded in locale (w/len).

    # Do nothing
    NOTHING no  Match empty string.
    # A variant of above which delimits a group, thus stops optimizations
    TAIL    no  Match empty string. Can jump here from outside.

    # STAR,PLUS '?', and complex '*' and '+', are implemented as circular
    #       BRANCH structures using BACK.  Simple cases (one character
    #       per match) are implemented with STAR and PLUS for speed
    #       and to minimize recursive plunges.
    #
    STAR    node    Match this (simple) thing 0 or more times.
    PLUS    node    Match this (simple) thing 1 or more times.

    CURLY   sv 2    Match this simple thing {n,m} times.
    CURLYN  no 2    Match next-after-this simple thing
    #           {n,m} times, set parens.
    CURLYM  no 2    Match this medium-complex thing {n,m} times.
    CURLYX  sv 2    Match this complex thing {n,m} times.

    # This terminator creates a loop structure for CURLYX
    WHILEM  no  Do curly processing and see if rest matches.

    # OPEN,CLOSE,GROUPP ...are numbered at compile time.
    OPEN    num 1   Mark this point in input as start of #n.
    CLOSE   num 1   Analogous to OPEN.

    REF     num 1   Match some already matched string
    REFF    num 1   Match already matched string, folded
    REFFL   num 1   Match already matched string, folded in loc.

    # grouping assertions
    IFMATCH off 1 2 Succeeds if the following matches.
    UNLESSM off 1 2 Fails if the following matches.
    SUSPEND off 1 1 "Independent" sub-regex.
    IFTHEN  off 1 1 Switch, should be preceded by switcher .
    GROUPP  num 1   Whether the group matched.

    # Support for long regex
    LONGJMP off 1 1 Jump far away.
    BRANCHJ off 1 1 BRANCH with long offset.

    # The heavy worker
    EVAL    evl 1   Execute some Perl code.

    # Modifiers
    MINMOD  no  Next operator is not greedy.
    LOGICAL no  Next opcode should set the flag only.

    # This is not used yet
    RENUM   off 1 1 Group with independently numbered parens.

    # This is not really a node, but an optimized away piece of a "long" node.
    # To simplify debugging output, we mark it as if it were a node
    OPTIMIZED   off Placeholder for dump.  

PRXMATCH Output

First of all, when doing a match, one may get no run-time output even if debugging is enabled. This means that the regex engine was never entered and that all of the job was therefore done by the optimizer.

Guessing start of match, REx `[bc]d(ef*g)+h[ij]k$' against `abcdefg__gh__'...
Found floating substr `gh' at offset 9...
Found anchored substr `de' at offset 3...
Starting position does not contradict /^/m...
Does not contradict STCLASS...
Guessed: match at offset 2
Matching REx `[bc]d(ef*g)+h[ij]k$' against `cdefg__gh__'
  Setting an EVAL scope, savestack=0
   2 <ab> <cdefg__gh_>    |  1:  ANYOF[bc]
   3 <abc> <defg__gh_>    | 10:  EXACT <d>
   4 <abcd> <efg__gh_>    | 12:  CURLYX[0] {1,32767}
   4 <abcd> <efg__gh_>    | 25:    WHILEM[1/1]
                              0 out of 1..32767  cc=3dcfca4
   4 <abcd> <efg__gh_>    | 14:      OPEN1
   4 <abcd> <efg__gh_>    | 16:      EXACT <e>
   5 <abcde> <fg__gh_>    | 18:      STAR
                           EXACT <f> can match 1 times out of 32767...
  Setting an EVAL scope, savestack=0
   6 <bcdef> <g__gh__>    | 21:        EXACT <g>
   7 <bcdefg> <__gh__>    | 23:        CLOSE1
   7 <bcdefg> <__gh__>    | 25:        WHILEM[1/1]
                                  1 out of 1..32767  cc=3dcfca4
  Setting an EVAL scope, savestack=9
   7 <bcdefg> <__gh__>    | 14:          OPEN1
   7 <bcdefg> <__gh__>    | 16:          EXACT <e>
                                    failed...
     restoring \1 to 4(4)..7
                                  failed, try continuation...
   7 <bcdefg> <__gh__>    | 26:          NOTHING
   7 <bcdefg> <__gh__>    | 27:          EXACT <h>
                                    failed...
                                  failed...
                                failed...
                              failed...
                            failed...
Match failed

The most significant information in the output is about the particular node of the compiled regex that is currently being tested against the target string. The format of these lines is

STRING-OFFSET <PRE-STRING> <POST-STRING> |ID: TYPE

The TYPE info is indented with respect to the backtracking level. Other incidental information appears interspersed within.


Your Turn

The developers, testers and documentation folk that bring you Base SAS Software are very excited about the potential of these new capabilities of the SAS System. You can send electronic mail to Base.Research@sas.com with your comments.