Previous Page | Next Page

Importing XML Documents

Importing an XML Document with Non-Escaped Character Data

W3C specifications (section 4.6 Predefined Entities) state that for character data, certain characters such as the left angle bracket (<), the ampersand (&), and the apostrophe (') must be escaped using character references or strings like &lt;, &amp;, and &apos;. For example, to allow attribute values to contain both single and double quotation marks, the apostrophe or single-quotation character (') can be represented as &apos; and the double-quotation character (") as &quot;.

To import an XML document that contains non-escaped characters, you can specify the LIBNAME statement option XMLPROCESS=PERMIT in order for the XML engine to accept character data that does not conform to W3C specifications. That is, non-escaped characters like the apostrophe, double quotation marks, and the ampersand are accepted in character data.

Note:   Use XMLPROCESS=PERMIT cautiously. If an XML document consists of non-escaped characters, the content is not standard XML construction. The option is provided for convenience, not to encourage invalid XML markup.   [cautionend]

This example imports the following XML document named Permit.XML, which contains non-escaped character data:

<?xml version="1.0" ?>
      <status>proper escape sequence</status>
       <status>unescaped character in CDATA</status>
       <ampersand><![CDATA[Abbott & Costello]]></ampersand>
       <squote><![CDATA[Logan's Run]]></squote>
       <dquote><![CDATA[This is "realworld" stuff]]></dquote>
       <less><![CDATA[ e < pi ]]></less>
       <greater><![CDATA[ pen > sword ]]></greater>
      <status>single unescaped character</status>
      <status>unescaped character in string</status>
      <ampersand>Dunn & Bradstreet</ampersand>
      <squote>Isn't this silly?</squote>
      <dquote>Quoth the raven, "Nevermore!"</dquote>

First, using the default XML engine behavior, which expects XML markup to conform to W3C specifications, the following SAS program imports only the first two observations, which contain valid XML markup, and produces errors for the last two records, which contain non-escaped characters:

libname permit xml 'c:\My Documents\XML\permit.xml';

proc print data=permit.chars;

SAS Log Output

ERROR: There is an illegal character in the entity name.
       encountered during XMLInput parsing
       occurred at or near line 24, column 22
NOTE: There were 2 observations read from the data set PERMIT.CHARS.

Specifying the LIBNAME statement option XMLPROCESS=PERMIT enables the XML engine to import the XML document:

libname permit xml 'c:\My Documents\XML\permit.xml' xmlprocess=permit;

proc print data=permit.chars;


                                The SAS System                                     1

Obs    GREATER         LESS       DQUOTE                           SQUOTE

  1    >               <          "                                '
  2    pen > sword     e < pi     This is "realworld" stuff        Logan's Run
  3                               "                                '
  4                               Quoth the raven, "Nevermore!"    Isn't this silly?

Obs    AMPERSAND            STATUS                       ACCEPT

  1    &                    proper escape sequence         OK
  2    Abbott & Costello    unescaped character in CDATA   OK
  3    &                    single unescaped character     NO
  4    Dunn & Bradstreet    unescaped character in string  NO

Previous Page | Next Page | Top of Page