SAS Institute. The Power to Know

FOCUS AREAS

Return to previous page

Base SAS

Importing an XML Document with Non-Escaped Character Data

The LIBNAME statement option XMLPROCESS= determines how the XML engine processes character data that does not conform to W3C specifications. W3C specifications (section 4.6 Predefined Entities) state that for character data, certain characters must be escaped using character references or strings.

For example, the left angle bracket (<), the ampersand (&), and the apostrophe (') can be escaped by using the following character strings (respectively):

   &lt;,
   &amp;
   &apos;
As another example, to allow attribute values to contain both single and double quotation marks, the apostrophe or single-quotation character (') and the double-quotation character (") can be represented as the following (respectively):
   &apos;
   &quot;

Characters that do not conform to the above W3C specifications are referred to as non-escaped. To import an XML document that contains non-escaped characters like the apostrophe, double quotation marks, and the ampersand, specify the LIBNAME statement option XMLPROCESS=RELAX.

Use XMLPROCESS=RELAX cautiously. If an XML document consists of non-escaped characters, the content is not standard XML construction. The intent of the option is to provide convenience, not to encourage invalid XML format.

This example imports the following XML document named Relax.XML, which contains non-escaped character data:

<?xml version="1.0" ?>
<RELAX>
   <CHARS>
      <accept>OK</accept>
      <status>proper escape sequence</status>
      <ampersand>&amp;</ampersand>
      <squote>&apos;</squote>
      <dquote>&quot;</dquote>
      <less>&lt;</less>
      <greater>&gt;</greater>
   </CHARS>
   <CHARS>
      <accept>OK</accept>
      <status>unescaped character in CDATA</status>
      <ampersand><![CDATA[Abbott & Costello] ]></ampersand>
      <squote><![CDATA[Logan's Run] ]></squote>
      <dquote><![CDATA[This is "realworld" stuff] ]></dquote>
      <less><![CDATA[ e <pi ] ]></less>
      <greater><![CDATA[ pen > sword ] ]></greater>
   </CHARS>

   <CHARS>
      <accept>NO</accept>
      <status>single unescaped character</status>
      <ampersand>&</ampersand>
      <squote>'</squote>
      <dquote>"</dquote>
      <!-- purposely left out the less tag here -->
      <greater/>
   </CHARS>
   <CHARS>
      <accept>NO</accept>
      <status>unescaped character in string</status>
      <ampersand>Dunn & Bradstreet</ampersand>
      <squote>Isn't this silly?</squote>
      <dquote>Quoth the raven, "Nevermore!"</dquote>
      <less></less>
      <!-- purposely left out the greater tag here -->
   </CHARS>
</RELAX>

Default Usage

Using the default XML engine behavior, which expects XML markup to conform to W3C specifications, the following SAS program imports only the first two observations, which contain valid XML markup, and produces errors for the last two records, which contain non-escaped characters:
   libname relax xml 'c:\My Documents\XML\relax.xml';

   proc print data=relax.chars;
   run;

SAS Log Output
ERROR: There is an illegal character in the entity name.
       encountered during XMLInput parsing
       occurred at or near line 24, column 22
NOTE: There were 2 observations read from the data set RELAX.CHARS.

Specifying XMLPROCESS=RELAX

Specifying the LIBNAME statement option XMLPROCESS=RELAX enables the XML engine to import the XML document:
   libname relax xml 'c:\My Documents\XML\relax.xml' xmlprocess=relax;

   proc print data=relax.chars;
   run;

PROC PRINT Output
                                The SAS System                                     1

Obs    GREATER         LESS       DQUOTE                           SQUOTE

  1    >               <          "                                '
  2    pen > sword     e < pi     This is "realworld" stuff        Logan's Run
  3                               "                                '
  4                               Quoth the raven, "Nevermore!"    Isn't this silly?

Obs    AMPERSAND            STATUS                       ACCEPT

  1    &                    proper escape sequence         OK
  2    Abbott & Costello    unescaped character in CDATA   OK
  3    &                    single unescaped character     NO
  4    Dunn & Bradstreet    unescaped character in string  NO