SAS Institute. The Power to Know

COMMUNITY

Insider's View

Perl Regular Expressions and SAS®9

In one development project at SAS, we're using several third-party test suites to exercise the new components that we're creating and to make sure that SAS conforms to standard functional specifications such as SQL:1999 and JDBC 3.0. While these tools are great for verifying how closely our software adheres to specifications, each tool reports its results separately using various formats and structures. In addition, some of these tools write result logs to multiple files, and some results are written to single massive logs that contain more detail than can reasonably be reviewed by one person. To enable us to more easily assess how well the software is performing, the results need to be summarized and collected into a single report.

The Perl regular expressions (PRX) functions and CALL routines that were introduced in SAS®9 provide convenient and powerful mechanisms for weeding through all the information that's generated by test tools, and extracting the essential information from the test results that each tool delivers. The data that's culled from the test logs can then be summarized, and reports can be prepared by using other SAS tools, such as the SUMMARY, FREQUENCY, and REPORT procedures.

I use the new PRX functions and CALL routines extensively to do this type of summary reporting in the SAS jobs that I create. (To be honest, I'm addicted to them. I use them just about every time that I need to scan a string for the occurrence of a word or a phrase, and when I need to extract a substring that might not occur at consistent, well-defined offsets within a character variable.) Perl regular expressions provide a flexible syntax for expressing character patterns. I use Perl regular expressions with a DATA step to read the test log files and to scan lines for strings that represent the results of test runs. I also use Perl regular expressions to search for individual test-case results, for summary lines that occur at points throughout the log files, and for error messages or unexpected irregularities in the test logs that might indicate test failures. Using Perl regular expressions in this way makes my SAS code simple, readable, and easy to maintain.

Names of all the PRX functions and CALL routines in Base SAS begin with the prefix PRX and are documented in "SAS Language Reference: Dictionary" in SAS OnlineDoc 9.1.2. If you've done any Perl programming in the past, you might already be familiar with Perl regular expression syntax. These DATA step functions support standard Perl regular expression syntax and semantics, so it will be easy for you to start using them.

If you're not already familiar with Perl regular expression syntax, take some time to study how Perl regular expressions work. There are many resources available. One thorough reference is Mastering Regular Expressions by Jeffrey Friedl (http://regex.info). If you're like me, you'll find that writing regular expressions is as much an art as a science. I recommend starting with simple cases and playing with the examples that you find in the SAS documentation and other references that you use. When beginning to work with new technology that seems particularly foreign to me (and I'll admit that Perl regular expressions can seem foreign!), I like to experiment and keep a notebook of examples I found that are typical of the situations that I have to deal with (for example, extracting dates and times).

While these new DATA step functions and CALL routines are pretty easy to understand and use, Perl regular expression syntax is rich and full-featured and can sometimes be confusing when you first start to use it. Be patient and take it one step at a time. Once you begin to master it, I think you'll find it a powerful addition to your toolkit.

page divider

About the Author

David Shamlin joined SAS Institute in 1987 as a member of the VMS Host group, where he helped develop low-level file systems. He also did a tour of duty with the SAS IO development team, becoming a pioneer in industry-standard data access interfaces to SAS data stores. David is now an R&D Director for Base Table Services where he leads the development of fundamental SAS technology related to the data step, Base PROCs, the LIBNAME supervisor, data set and catalog IO, SAS/SHARE and other client/server components related to SAS data access.