| Perl regular expressions
|
Perl Regular Expressions are supported beginning with SAS®9.
Perl produces regexp debugging output by using the -Dr switch on the command line or by using use re 'debug' withing a program. The data step provides this functionality with the PRXDEBUG call routine. CALL PRXDEBUG(1) turns on debug output and CALL PRXDEBUG(0) turns off debug output. All debug output is sent to the SAS log.
This paper will present an example of and describe Perl debug output. All of this material is taken from the perldebguts man page or at http://search.cpan.org/~gsar/perl-5.6.1/pod/perldebguts.pod#Debugging_regular_expressions.
data _null_;
call prxdebug(1);
putlog 'PRXPARSE:';
re = prxparse('/[bc]d(ef*g)+h[ij]k$/');
putlog 'PRXMATCH:';
pos = prxmatch(re, 'abcdefg__gh__');
call prxdebug(0);
run;
Compiling REx `[bc]d(ef*g)+h[ij]k$'
size 41 first at 1
rarest char g at 0
rarest char d at 0
1: ANYOF[bc](10)
10: EXACT <d>(12)
12: CURLYX[0] {1,32767}(26)
14: OPEN1(16)
16: EXACT <e>(18)
18: STAR(21)
19: EXACT <f>(0)
21: EXACT <g>(23)
23: CLOSE1(25)
25: WHILEM[1/1](0)
26: NOTHING(27)
27: EXACT <h>(29)
29: ANYOF[ij](38)
38: EXACT <k>(40)
40: EOL(41)
41: END(0)
anchored `de' at 1 floating `gh' at 3..2147483647 (checking floating)
stclass `ANYOF[bc]' minlen 7
The first line shows the pre-compiled form of the regex. The second shows the size of the compiled form (in arbitrary units, usually 4-byte words) and the label id of the first node that does a match.
The last line (split into two lines above) contains optimizer
information. In the example shown, the optimizer found that the match
should contain a substring de at offset 1, plus substring gh
at some offset between 3 and infinity. Moreover, when checking for
these substrings (to abandon impossible matches quickly), Perl will check
for the substring gh before checking for the substring de. The
optimizer may also use the knowledge that the match starts (at the
first id) with a character class, and the match cannot be
shorter than 7 chars.
The fields of interest which may appear in the last line are
anchored STRING at POSfloating STRING at POS1..POS2matching floating/anchoredminlenstclass TYPEnoscanisallGPOS\G.plus x+y).implicit.*.with eval (?{ code }) and
(??{ code }).anchored(TYPE)TYPE
being BOL, MBOL, or GPOS. See the table below.If a substring is known to match at end-of-line only, it may be
followed by $, as in floating `k'$.
The optimizer-specific info is used to avoid entering (a slow) regex
engine on strings that will not definitely match. If isall flag
is set, a call to the regex engine may be avoided even when the optimizer
found an appropriate place for the match.
The rest of the output contains the list of nodes of the compiled form of the regex. Each line has format
id: TYPE OPTIONAL-INFO (next-id)
Here are the possible types, with short descriptions:
|
First of all, when doing a match, one may get no run-time output even if debugging is enabled. This means that the regex engine was never entered and that all of the job was therefore done by the optimizer.
Guessing start of match, REx `[bc]d(ef*g)+h[ij]k$' against `abcdefg__gh__'...
Found floating substr `gh' at offset 9...
Found anchored substr `de' at offset 3...
Starting position does not contradict /^/m...
Does not contradict STCLASS...
Guessed: match at offset 2
Matching REx `[bc]d(ef*g)+h[ij]k$' against `cdefg__gh__'
Setting an EVAL scope, savestack=0
2 <ab> <cdefg__gh_> | 1: ANYOF[bc]
3 <abc> <defg__gh_> | 10: EXACT <d>
4 <abcd> <efg__gh_> | 12: CURLYX[0] {1,32767}
4 <abcd> <efg__gh_> | 25: WHILEM[1/1]
0 out of 1..32767 cc=3dcfca4
4 <abcd> <efg__gh_> | 14: OPEN1
4 <abcd> <efg__gh_> | 16: EXACT <e>
5 <abcde> <fg__gh_> | 18: STAR
EXACT <f> can match 1 times out of 32767...
Setting an EVAL scope, savestack=0
6 <bcdef> <g__gh__> | 21: EXACT <g>
7 <bcdefg> <__gh__> | 23: CLOSE1
7 <bcdefg> <__gh__> | 25: WHILEM[1/1]
1 out of 1..32767 cc=3dcfca4
Setting an EVAL scope, savestack=9
7 <bcdefg> <__gh__> | 14: OPEN1
7 <bcdefg> <__gh__> | 16: EXACT <e>
failed...
restoring \1 to 4(4)..7
failed, try continuation...
7 <bcdefg> <__gh__> | 26: NOTHING
7 <bcdefg> <__gh__> | 27: EXACT <h>
failed...
failed...
failed...
failed...
failed...
Match failed
The most significant information in the output is about the particular node of the compiled regex that is currently being tested against the target string. The format of these lines is
STRING-OFFSET <PRE-STRING> <POST-STRING> |ID: TYPE
The TYPE info is indented with respect to the backtracking level. Other incidental information appears interspersed within.