Understanding the Required Physical Structure for an XML Document to Be Imported Using the GENERIC Markup Type

What Is the Required Physical Structure?

In order for an XML document to be successfully imported, the requirements for well-formed XML must translate as follows:
  • The root-enclosing element (top-level node) of an XML document is the document container. For SAS, it is like the SAS library.
  • The nested elements (repeating element instances) that occur within the container begin with the second-level instance tag.
  • The repeating element instances must represent a rectangular organization. For a SAS data set, they determine the observation boundary that becomes a collection of rows with a constant set of columns.
Here is an example of an XML document that illustrates the physical structure that is required:
<?xml version="1.0" encoding="windows-1252" ?>
<LIBRARY> 1
   <STUDENTS> 2
      <ID> 0755 </ID>
      <NAME> Brad Martin </NAME>
      <ADDRESS> 1611 Glengreen </ADDRESS>
      <CITY> Huntsville </CITY>
      <STATE> Texas </STATE>
   </STUDENTS>

   <STUDENTS> 3
      <ID> 1522 </ID>
      <NAME> Zac Harvell </NAME>
      <ADDRESS> 11900 Glenda </ADDRESS>
      <CITY> Houston </CITY>
      <STATE> Texas </STATE>
   </STUDENTS>
.
.  more instances of <STUDENTS>
.
</LIBRARY>
When the previous XML document is imported, the following happens:
1 The XML engine recognizes <LIBRARY> as the root-enclosing element.
2 The engine goes to the second-level instance tag, which is <STUDENTS>, translates it as the data set name, and begins scanning the elements that are nested (contained) between the <STUDENTS> start tag and the </STUDENTS> end tag, looking for variables.
3 Because the instance tags <ID>, <NAME>, <ADDRESS>, <CITY>, and <STATE> are contained within the <STUDENTS> start tag and </STUDENTS> end tag, the XML engine interprets them as variables. The individual instance tag names become the data set variable names. The repeating element instances are translated into a collection of rows with a constant set of columns.
These statements result in the following SAS output:
libname test xml 'C:\My Documents\students.xml';

proc print data=test.students;
run;
PRINT Procedure Output for TEST.STUDENTS
PRINT Procedure Output for TEST.STUDENTS

Why Is a Specific Physical Structure Required?

Well-formed XML is determined by structure, not content. Therefore, although the XML engine can assume that the XML document is valid, well-formed XML, the engine cannot assume that the root element encloses only instances of a single node element (that is, only a single data set). Therefore, the XML engine has to account for the possibility of multiple nodes (that is, multiple SAS data sets).
For example, when the following correctly structured XML document is imported, it is recognized as containing two SAS data sets: HIGHTEMP and LOWTEMP.
<?xml version="1.0" encoding="windows-1252" ?>
<CLIMATE> 1
   <HIGHTEMP> 2
      <PLACE> Libya </PLACE>
      <DATE> 1922-09-13 </DATE>
      <DEGREE-F> 136 </DEGREE-F>
      <DEGREE-C> 58 </DEGREE-C>
   </HIGHTEMP>
.
.  more instances of <HIGHTEMP>
.
   <LOWTEMP> 3
      <PLACE> Antarctica </PLACE>
      <DATE> 1983-07-21 </DATE>
      <DEGREE-F> -129 </DEGREE-F>
      <DEGREE-C> -89 </DEGREE-C>
   </LOWTEMP>
.
.  more instances of <LOWTEMP>
.
</CLIMATE>
When the previous XML document is imported, the following happens:
1 The XML engine recognizes the first instance tag <CLIMATE> as the root-enclosing element, which is the container for the document.
2 Starting with the second-level instance tag, which is <HIGHTEMP>, the XML engine uses the repeating element instances as a collection of rows with a constant set of columns.
3 When the second-level instance tag changes, the XML engine interprets that change as a different SAS data set.
The result is two SAS data sets: HIGHTEMP and LOWTEMP. Both happen to have the same variables but different data.
To ensure that an import result is what you expect, use the DATASETS procedure. For example, these SAS statements result in the following:
libname climate xml 'C:\My Documents\climate.xml';

proc datasets library=climate;
quit;
DATASETS Procedure Output for CLIMATE Library
DATASETS Procedure Output for CLIMATE library

Handling XML Documents That Are Not in the Required Physical Structure

If your XML document is not in the required physical structure, you can tell the XML engine how to interpret the XML markup in order to successfully import the document. See Why Use an XMLMap When Importing?.