| SAS 8.2 XML engine
|
Using the XML LIBNAME engine to export an XML document is fairly straightforward. Using the engine to import an XML document is also fairly straightforward as long as the XML adheres to the required physical structure. If not, the XML document will need additional processing.
If you are running Release 8.2, you can download a hotfix that enables the XML LIBNAME engine to successfully import a wider variety of XML documents than with Release 8.2 as shipped.
NOTE: Complete documentation for the XML LIBNAME engine is available from the SAS System Help in Release 8.1 and Release 8.2 of SAS software. Or see XML Engine.
libname rss xml 'C:\My Documents\xml\rss.xml'; proc contents data=rss.channel; run;While the import process executes with no errors, the result is not desirable. Looking at the CONTENTS procedure output, the translation of the XML resulted in many more columns (SAS variables) than you would expect...or want.
* The XML document RSS.XML uses the XML format RSS (Rich Site Summary), which was designed by Netscape originally for exchange of content within the My Netscape Network (MNN) community. The RSS format has been widely adopted for sharing headlines and other Web content and is a good example of XML as a transmission format. For information on RSS, see Introduction to RSS.
Because RSS.XML is well-formed XML, it may be expected that SAS will successfully import the file. However, SAS requires a specific XML physical structure in order to identify columns of data from collections of rows. The XML in RSS.XML is incorrectly structured for SAS to interpret successfully.
<?xml version="1.0" ?>
<LIBRARY> -- root enclosing element
<STUDENTS> -- repeating element instance...one row
<ID> 0755 </ID>
<NAME> Brad Martin </NAME>
<ADDRESS> 1611 Glengreen </ADDRESS>
<CITY> Huntsville </CITY>
<STATE> Texas </STATE>
</STUDENTS>
<STUDENTS> -- repeating element instance...second row
<ID> 1522 </ID>
<NAME> Zac Harvell </NAME>
<ADDRESS> 11900 Glenda </ADDRESS>
<CITY> Houston </CITY>
<STATE> Texas </STATE>
</STUDENTS>
.
. more instances of <STUDENTS>
.
</LIBRARY>
This is what happens when SAS imports the above XML document:
libname test xml 'C:\My Documents\test\students.xml'; proc print data=test.students; run;
ID NAME ADDRESS CITY STATE 0755 Brad Martin 1611 Glengreen Huntsville Texas 1522 Zac Harvell 11900 Glenda Houston Texas . . . |
| This XML... | Results in... |
|---|---|
<rss version="0.91"> |
SAS recognizes the first instance tag as the root enclosing element, which is the document container. |
<channel> |
Starting with the second-level instance tag, SAS begins to scan for columns. |
<title>WriteTheWeb</title> <link>http://writetheweb.com/</link> <description>News for web users... </description> <language>en-us</language> <copyright>Copyright 2000...</copyright> <managingEditor>editor@...</managingEditor> <webMaster>webmaster@...</webMaster> |
SAS encounters the children nodes <TITLE>, <LINK>, <DESCRIPTION>, <LANGUAGE>, <COPYRIGHT>, <MANAGINGEDITOR>, and <WEBMASTER>, which become columns in the data set CHANNEL. |
<image> <title>WriteTheWeb</title> <url>http://writetheweb.com/...</url> <link>http://writetheweb.com</link> <width>88</width> <height>31</height> <description>News for ...</description> </image> |
Because the physical structure of the XML
is still within the second level instance,
SAS continues to scan for columns and
encounters the tag <IMAGE>. To SAS, it's a column.
Then, SAS encounters another tag with the name <TITLE>. Already the physical structure of the XML is causing SAS to translate an increasing number of columns, but now there is further complication due to the recurring pattern of the tag names like <TITLE>, <LINK>, and <DESCRIPTION>. In order to create unique SAS names for the columns, SAS adds incrementing numbers to the end of the tag names, resulting in TITLE0, TITLE1, and so on. The result, as shown in CONTENTS procedure output, is a lot of columns named TITLE0 through TITLE7, DESCRIPTION0 through DESCRIPTION7, and so on. (Note that for this PROC CONTENTS output, the VARNUM option is used to list the column names in the order in which they were defined.) |
To successfully import the XML document, the hierarchical structure must be flattened to a rectangular, two-level organization.
For example, when SAS imports the following correctly-structured XML document, it is recognized as containing two SAS data sets:
<?xml version="1.0" ?>
<CLIMATE>
<HIGHTEMP>
<PLACE> Libya </PLACE>
<DATE> 1922-09-13 </DATE>
<DEGREE_F> 136 </DEGREE_F>
<DEGREE_C> 58 </DEGREE_C>
</HIGHTEMP>
.
. more instances of <HIGHTEMP>
.
<LOWTEMP>
<PLACE> Antarctica </PLACE>
<DATE> 1983-07-21 </DATE>
<DEGREE_F> -129 </DEGREE_F>
<DEGREE_C> -89 </DEGREE_C>
</LOWTEMP>
.
. more instances of <LOWTEMP>
.
</CLIMATE>
This is what happens when SAS imports the above XML document:
TIP: Use the DATASETS procedure to ensure that the import result is what you expect. For example, these SAS statements result in the output that follows:
libname climate xml 'C:\My Documents\xml\climate.xml'; proc datasets library=climate;
-----Directory-----
Libref: CLIMATE
Engine: XML
Physical Name: C:\My Documents\xml\climate.xml
# Name Memtype
-------------------
1 HIGHTEMP DATA
2 LOWTEMP DATA
|
XSL (Extensible Stylesheet Language) is a markup language designed for creating stylesheets. XSL specifies the rendering of an XML document using the expression language it defines for selecting and accessing elements and data (Xpath) in an XML document. The stylesheet contains a set of rules (XSLT) for transforming individual XML entities. Transformation is achieved by associating patterns (located within nodes) with templates (containing actions).
An XSL stylesheet transforms the original XML document into the correctly-structured XML document by transforming the hierarchical structure into a rectangular, two-level organization. Here's an example to show the steps in this transformation process.
NOTE: This document does not attempt to teach XSL. To learn more about XSL and XSLT, information is available, for example, on the Internet. XSL is developed, defined, and maintained by the World Wide Web Consortium (W3C), which provides complete explanation and syntax for XSL and XSLT at http://www.w3.org/Style/XSL/. You will also need an XSL processor. A couple of choices include
| Stage | Example |
| 1. You create an XSL stylesheet. | RSS2SAS.XSL |
| 2. You use an XSL processor to apply the stylesheet to the XML document to be transformed. For example, you could use the XT processor by James Clark mentioned above. |
xt rss.xml rss2sas.xsl rsssas.xml
TIP: You can also invoke an XSL processor from SAS using the 'X' command. However, before you process an XSL stylesheet from SAS, be sure that there are no errors.
|
| 3. The XSL processor transforms the original XML document to a new XML document using the rules in the stylesheet. | RSSSAS.XML |
| 4. You import the transformed XML document. |
libname rss xml 'C:\My Documents\xml\rsssas.xml'; proc contents data=rss.item; run; proc print data=rss.item; run; |
| 5. SAS successfully imports the transformed XML document. | CONTENTS procedure output and PRINT procedure output. |