SAS Institute. The Power to Know

SAS(R) 9.2 XML LIBNAME Engine: User's Guide

Previous Page | Next Page

Importing XML Documents Using an XMLMap

Understanding the Required Physical Structure for an XML Document to Be Imported Using the GENERIC Markup Type


What Is the Required Physical Structure?

For an XML document to be successfully imported, the requirements for well-formed XML must translate as follows:

  • The root-enclosing element (top-level node) of an XML document is the document container. For SAS, it is like the SAS library.

  • The nested elements (repeating element instances) that occur within the container begin with the second-level instance tag.

  • The repeating element instances must represent a rectangular organization. For a SAS data set, they determine the observation boundary that becomes a collection of rows with a constant set of columns.

Here is an example of an XML document that illustrates the physical structure that is required:

<?xml version="1.0" encoding="windows-1252" ?> 
<LIBRARY> 1 
   <STUDENTS> 2 
      <ID> 0755 </ID>
      <NAME> Brad Martin </NAME>
      <ADDRESS> 1611 Glengreen </ADDRESS>
      <CITY> Huntsville </CITY>
      <STATE> Texas </STATE>
   </STUDENTS>

   <STUDENTS> 3  
      <ID> 1522 </ID>
      <NAME> Zac Harvell </NAME>
      <ADDRESS> 11900 Glenda </ADDRESS>
      <CITY> Houston </CITY>
      <STATE> Texas </STATE>
   </STUDENTS>
.
.  more instances of <STUDENTS> 
.
</LIBRARY>

When the previous XML document is imported, the following happens:

  1. The XML engine recognizes <LIBRARY> as the root-enclosing element.

  2. The engine goes to the second-level instance tag, which is <STUDENTS>, translates it as the data set name, and begins scanning the elements that are nested (contained) between the <STUDENTS> start tag and the </STUDENTS> end tag, looking for variables.

  3. Because the instance tags <ID>, <NAME>, <ADDRESS>, <CITY>, and <STATE> are contained within the <STUDENTS> start tag and </STUDENTS> end tag, the XML engine interprets them as variables. The individual instance tag names become the data set variable names. The repeating element instances are translated into a collection of rows with a constant set of columns.

These statements result in the following SAS output:

libname test xml 'C:\My Documents\test\students.xml';  

proc print data=test.students;
run;

PROC PRINT of TEST.STUDENTS

 
ID       NAME            ADDRESS            CITY           STATE  

0755     Brad Martin     1611 Glengreen     Huntsville     Texas 
1522     Zac Harvell     11900 Glenda       Houston        Texas 
. 
. 
.

Why Is a Specific Physical Structure Required?

Well-formed XML is determined by structure, not content. Therefore, while the XML engine can assume that the XML document is valid, well-formed XML, the engine cannot assume that the root element encloses only instances of a single node element, that is, only a single data set. Therefore, the XML engine has to account for the possibility of multiple nodes, that is, multiple SAS data sets.

For example, when the following correctly structured XML document is imported, it is recognized as containing two SAS data sets: HIGHTEMP and LOWTEMP.

<?xml version="1.0" encoding="windows-1252" ?> 
<CLIMATE> 1 
   <HIGHTEMP> 2 
      <PLACE> Libya </PLACE>
      <DATE> 1922-09-13 </DATE>
      <DEGREE-F> 136 </DEGREE-F>
      <DEGREE-C> 58 </DEGREE-C>
   </HIGHTEMP>
.
.  more instances of <HIGHTEMP> 
.
   <LOWTEMP> 3 
      <PLACE> Antarctica </PLACE>
      <DATE> 1983-07-21 </DATE>
      <DEGREE-F> -129 </DEGREE-F>
      <DEGREE-C> -89 </DEGREE-C>
   </LOWTEMP>
.
.  more instances of <LOWTEMP> 
.
</CLIMATE>

When the previous XML document is imported, the following happens

:

  1. The XML engine recognizes the first instance tag <CLIMATE> as the root-enclosing element, which is the container for the document.

  2. Starting with the second-level instance tag, which is <HIGHTEMP>, the XML engine uses the repeating element instances as a collection of rows with a constant set of columns.

  3. When the second-level instance tag changes, the XML engine interprets that change as a different SAS data set.

The result is two SAS data sets: HIGHTEMP and LOWTEMP. Both happen to have the same variables, but of course, different data.

To ensure that an import result is what you expect, use the DATASETS procedure. For example, these SAS statements result in the following:

libname climate xml 'C:\My Documents\xml\climate.xml';  

proc datasets library=climate;
quit;

PROC DATASETS Output for CLIMATE Library

                            -----Directory-----

                Libref:        CLIMATE
                Engine:        XML
                Physical Name: C:\My Documents\xml\climate.xml

                             #  Name     Memtype
                             -------------------
                             1 HIGHTEMP  DATA
                             2 LOWTEMP   DATA

Handling XML Documents That Are Not in the Required Physical Structure

If your XML document is not in the required physical structure, you can tell the XML engine how to interpret the XML markup in order to successfully import the document. See Importing XML Documents Using an XMLMap.

Previous Page | Next Page | Top of Page