SAS Institute. The Power to Know

FOCUS AREAS

SAS 8.2 XML engine

Base SAS

Why Can't SAS Import My XML Document?

Contents


Introduction

Starting with Release 8.1, the SAS XML LIBNAME engine enables you to

Using the XML LIBNAME engine to export an XML document is fairly straightforward. Using the engine to import an XML document is also fairly straightforward as long as the XML adheres to the required physical structure. If not, the XML document will need additional processing.

If you are running Release 8.2, you can download a hotfix that enables the XML LIBNAME engine to successfully import a wider variety of XML documents than with Release 8.2 as shipped.

NOTE: Complete documentation for the XML LIBNAME engine is available from the SAS System Help in Release 8.1 and Release 8.2 of SAS software. Or see XML Engine.


Abstract

This document explains the XML physical structure that SAS requires in order to successfully import an XML document, and shows you one way to transform an XML document into the correct physical structure.


Importing an XML Document

The following SAS statements attempt to import the XML document RSS.XML *:

   libname rss xml 'C:\My Documents\xml\rss.xml';

   proc contents data=rss.channel;
   run;
While the import process executes with no errors, the result is not desirable. Looking at the CONTENTS procedure output, the translation of the XML resulted in many more columns (SAS variables) than you would expect...or want.

* The XML document RSS.XML uses the XML format RSS (Rich Site Summary), which was designed by Netscape originally for exchange of content within the My Netscape Network (MNN) community. The RSS format has been widely adopted for sharing headlines and other Web content and is a good example of XML as a transmission format. For information on RSS, see Introduction to RSS.


What's Wrong with the XML Document?

If you’ve experienced unexpected results from importing an XML document, your question was more likely, "What’s wrong with SAS?"

Because RSS.XML is well-formed XML, it may be expected that SAS will successfully import the file. However, SAS requires a specific XML physical structure in order to identify columns of data from collections of rows. The XML in RSS.XML is incorrectly structured for SAS to interpret successfully.

What is the Required Physical Structure?

For an XML document to be successfully imported, the requirements for well-formed XML must translate to SAS as follows: Here's an example of an XML document that illustrates the physical structure that SAS requires:

<?xml version="1.0" ?>
<LIBRARY>  -- root enclosing element 
   <STUDENTS>  -- repeating element instance...one row 
      <ID> 0755 </ID>
      <NAME> Brad Martin </NAME>
      <ADDRESS> 1611 Glengreen </ADDRESS>
      <CITY> Huntsville </CITY>
      <STATE> Texas </STATE>
   </STUDENTS>

   <STUDENTS>  -- repeating element instance...second row 
      <ID> 1522 </ID>
      <NAME> Zac Harvell </NAME>
      <ADDRESS> 11900 Glenda </ADDRESS>
      <CITY> Houston </CITY>
      <STATE> Texas </STATE>
   </STUDENTS>
.
.  more instances of <STUDENTS> 
.
</LIBRARY>

This is what happens when SAS imports the above XML document:

  1. SAS goes to the second-level instance tag, which is <STUDENTS>, translates it as the data set name, and begins scanning the elements that are nested (contained) between the beginning <STUDENTS> and ending </STUDENTS> instance tags, looking for columns.

  2. Because the instance tags <ID>, <NAME>, <ADDRESS>, <CITY>, and <STATE> are all contained within the <STUDENTS> beginning and ending instance tags, SAS interprets them as columns. The individual instance tag names become the data set column names. The repeating element instances are translated into a collection of rows with that constant set of columns.
These SAS statements result in the following output:

   libname test xml 'C:\My Documents\test\students.xml';

   proc print data=test.students;
   run;

ID       NAME            ADDRESS            CITY           STATE

0755     Brad Martin     1611 Glengreen     Huntsville     Texas
1522     Zac Harvell     11900 Glenda       Houston        Texas
.
.
.

What's Wrong with RSS.XML's Physical Structure?

Looking again at the XML document RSS.XML, you'll see that the physical structure of the XML does not represent a solely rectangular organization...it is hierarchical...which results in the following translation process by SAS. (In the following table, some of the XML markup is not shown.)

This XML... Results in...
<rss version="0.91">
SAS recognizes the first instance tag as the root enclosing element, which is the document container.
<channel>
Starting with the second-level instance tag, SAS begins to scan for columns.
<title>WriteTheWeb</title>
<link>http://writetheweb.com/</link>
<description>News for web users...
</description>
<language>en-us</language>
<copyright>Copyright 2000...</copyright>
<managingEditor>editor@...</managingEditor>
<webMaster>webmaster@...</webMaster>
SAS encounters the children nodes <TITLE>, <LINK>, <DESCRIPTION>, <LANGUAGE>, <COPYRIGHT>, <MANAGINGEDITOR>, and <WEBMASTER>, which become columns in the data set CHANNEL.
<image>
<title>WriteTheWeb</title>
<url>http://writetheweb.com/...</url>
<link>http://writetheweb.com</link>
<width>88</width>
<height>31</height>
<description>News for ...</description>
</image>
Because the physical structure of the XML is still within the second level instance, SAS continues to scan for columns and encounters the tag <IMAGE>. To SAS, it's a column.

Then, SAS encounters another tag with the name <TITLE>. Already the physical structure of the XML is causing SAS to translate an increasing number of columns, but now there is further complication due to the recurring pattern of the tag names like <TITLE>, <LINK>, and <DESCRIPTION>.

In order to create unique SAS names for the columns, SAS adds incrementing numbers to the end of the tag names, resulting in TITLE0, TITLE1, and so on.

The result, as shown in CONTENTS procedure output, is a lot of columns named TITLE0 through TITLE7, DESCRIPTION0 through DESCRIPTION7, and so on. (Note that for this PROC CONTENTS output, the VARNUM option is used to list the column names in the order in which they were defined.)

To successfully import the XML document, the hierarchical structure must be flattened to a rectangular, two-level organization.

Why is the Physical Structure Required?

Well-formed XML is determined by structure…not content. Therefore, while SAS can assume that the XML document is well-formed XML, SAS cannot assume that the root element encloses only instances of a single node element, that is, only a single data set. Therefore, SAS has to account for the possibility of multiple nodes, that is, multiple SAS data sets.

For example, when SAS imports the following correctly-structured XML document, it is recognized as containing two SAS data sets:

<?xml version="1.0" ?>
<CLIMATE>
   <HIGHTEMP>
      <PLACE> Libya </PLACE>
      <DATE> 1922-09-13 </DATE>
      <DEGREE_F> 136 </DEGREE_F>
      <DEGREE_C> 58 </DEGREE_C>
   </HIGHTEMP>
.
.  more instances of <HIGHTEMP> 
.
   <LOWTEMP>
      <PLACE> Antarctica </PLACE>
      <DATE> 1983-07-21 </DATE>
      <DEGREE_F> -129 </DEGREE_F>
      <DEGREE_C> -89 </DEGREE_C>
   </LOWTEMP>
.
.  more instances of <LOWTEMP> 
.
</CLIMATE>

This is what happens when SAS imports the above XML document:

  1. SAS recognizes the first instance tag <CLIMATE> as the root enclosing element, which is the container for the document…the SAS library.

  2. Starting with the second-level instance tag, which is <HIGHTEMP>, SAS uses the repeating element instances as a collection of rows with a constant set of columns.

  3. When the second-level instance tag changes, SAS interprets that change as a different SAS data set.
The result is two SAS data sets: HIGHTEMP and LOWTEMP. Both happen to have the same columns, but of course, different data.

TIP: Use the DATASETS procedure to ensure that the import result is what you expect. For example, these SAS statements result in the output that follows:

   libname climate xml 'C:\My Documents\xml\climate.xml';

   proc datasets library=climate;

                            -----Directory-----

                Libref:        CLIMATE
                Engine:        XML
                Physical Name: C:\My Documents\xml\climate.xml

                             #  Name     Memtype
                             -------------------
                             1 HIGHTEMP  DATA
                             2 LOWTEMP   DATA


What Can You Do?

If your XML document does not import successfully, you must transform the XML into the physical structure that is required by SAS. One method to achieve the XML transformation is to use XSL.

XSL (Extensible Stylesheet Language) is a markup language designed for creating stylesheets. XSL specifies the rendering of an XML document using the expression language it defines for selecting and accessing elements and data (Xpath) in an XML document. The stylesheet contains a set of rules (XSLT) for transforming individual XML entities. Transformation is achieved by associating patterns (located within nodes) with templates (containing actions).

An XSL stylesheet transforms the original XML document into the correctly-structured XML document by transforming the hierarchical structure into a rectangular, two-level organization. Here's an example to show the steps in this transformation process.

NOTE: This document does not attempt to teach XSL. To learn more about XSL and XSLT, information is available, for example, on the Internet. XSL is developed, defined, and maintained by the World Wide Web Consortium (W3C), which provides complete explanation and syntax for XSL and XSLT at http://www.w3.org/Style/XSL/. You will also need an XSL processor. A couple of choices include

Stage Example
1. You create an XSL stylesheet. RSS2SAS.XSL
2. You use an XSL processor to apply the stylesheet to the XML document to be transformed. For example, you could use the XT processor by James Clark mentioned above. xt rss.xml rss2sas.xsl rsssas.xml

TIP: You can also invoke an XSL processor from SAS using the 'X' command. However, before you process an XSL stylesheet from SAS, be sure that there are no errors.

x 'xt rss.xml rss2sas.xsl rsssas.xml';

3. The XSL processor transforms the original XML document to a new XML document using the rules in the stylesheet. RSSSAS.XML
4. You import the transformed XML document.
libname rss xml 'C:\My Documents\xml\rsssas.xml';

proc contents data=rss.item;
run;

proc print data=rss.item;
run;
5. SAS successfully imports the transformed XML document. CONTENTS procedure output and PRINT procedure output.


Your Turn

Questions or comments? Send electronic mail to XMLEngine@sas.com with your comments.