Importing XML Documents |
W3C specifications (section 4.6 Predefined Entities) state that for character data, certain characters such as the left angle bracket (<), the ampersand (&), and the apostrophe (') must be escaped using character references or strings like <, &, and '. For example, to allow attribute values to contain both single and double quotation marks, the apostrophe or single-quotation character (') can be represented as ' and the double-quotation character (") as ".
To import an XML document that contains non-escaped characters, you can specify the LIBNAME statement option XMLPROCESS=PERMIT in order for the XML engine to accept character data that does not conform to W3C specifications. That is, non-escaped characters like the apostrophe, double quotation marks, and the ampersand are accepted in character data.
Note: Use XMLPROCESS=PERMIT cautiously. If an XML document consists of non-escaped characters, the content is not standard XML construction. The option is provided for convenience, not to encourage invalid XML markup.
This example imports the following XML document named Permit.XML, which contains non-escaped character data:
<?xml version="1.0" ?> <PERMIT> <CHARS> <accept>OK</accept> <status>proper escape sequence</status> <ampersand>&</ampersand> <squote>'</squote> <dquote>"</dquote> <less><</less> <greater>></greater> </CHARS> <CHARS> <accept>OK</accept> <status>unescaped character in CDATA</status> <ampersand><![CDATA[Abbott & Costello]]></ampersand> <squote><![CDATA[Logan's Run]]></squote> <dquote><![CDATA[This is "realworld" stuff]]></dquote> <less><![CDATA[ e < pi ]]></less> <greater><![CDATA[ pen > sword ]]></greater> </CHARS> <CHARS> <accept>NO</accept> <status>single unescaped character</status> <ampersand>&</ampersand> <squote>'</squote> <dquote>"</dquote> <less></less> <greater></greater> </CHARS> <CHARS> <accept>NO</accept> <status>unescaped character in string</status> <ampersand>Dunn & Bradstreet</ampersand> <squote>Isn't this silly?</squote> <dquote>Quoth the raven, "Nevermore!"</dquote> <less></less> <greater></greater> </CHARS> </PERMIT>
First, using the default XML engine behavior, which expects XML markup to conform to W3C specifications, the following SAS program imports only the first two observations, which contain valid XML markup, and produces errors for the last two records, which contain non-escaped characters:
libname permit xml 'c:\My Documents\XML\permit.xml'; proc print data=permit.chars; run;
ERROR: There is an illegal character in the entity name. encountered during XMLInput parsing occurred at or near line 24, column 22 NOTE: There were 2 observations read from the data set PERMIT.CHARS.
Specifying the LIBNAME statement option XMLPROCESS=PERMIT enables the XML engine to import the XML document:
libname permit xml 'c:\My Documents\XML\permit.xml' xmlprocess=permit; proc print data=permit.chars; run;
The SAS System 1 Obs GREATER LESS DQUOTE SQUOTE 1 > < " ' 2 pen > sword e < pi This is "realworld" stuff Logan's Run 3 " ' 4 Quoth the raven, "Nevermore!" Isn't this silly? Obs AMPERSAND STATUS ACCEPT 1 & proper escape sequence OK 2 Abbott & Costello unescaped character in CDATA OK 3 & single unescaped character NO 4 Dunn & Bradstreet unescaped character in string NO
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.