Importing XML Documents Using an XMLMap |
What Is the Required Physical Structure? |
For an XML document to be successfully imported, the requirements for well-formed XML must translate as follows:
The root-enclosing element (top-level node) of an XML document is the document container. For SAS, it is like the SAS library.
The nested elements (repeating element instances) that occur within the container begin with the second-level instance tag.
The repeating element instances must represent a rectangular organization. For a SAS data set, they determine the observation boundary that becomes a collection of rows with a constant set of columns.
Here is an example of an XML document that illustrates the physical structure that is required:
<?xml version="1.0" encoding="windows-1252" ?> <LIBRARY> 1 <STUDENTS> 2 <ID> 0755 </ID> <NAME> Brad Martin </NAME> <ADDRESS> 1611 Glengreen </ADDRESS> <CITY> Huntsville </CITY> <STATE> Texas </STATE> </STUDENTS> <STUDENTS> 3 <ID> 1522 </ID> <NAME> Zac Harvell </NAME> <ADDRESS> 11900 Glenda </ADDRESS> <CITY> Houston </CITY> <STATE> Texas </STATE> </STUDENTS> . . more instances of <STUDENTS> . </LIBRARY>
When the previous XML document is imported, the following happens:
The XML engine recognizes <LIBRARY> as the root-enclosing element.
The engine goes to the second-level instance tag, which is <STUDENTS>, translates it as the data set name, and begins scanning the elements that are nested (contained) between the <STUDENTS> start tag and the </STUDENTS> end tag, looking for variables.
Because the instance tags <ID>, <NAME>, <ADDRESS>, <CITY>, and <STATE> are contained within the <STUDENTS> start tag and </STUDENTS> end tag, the XML engine interprets them as variables. The individual instance tag names become the data set variable names. The repeating element instances are translated into a collection of rows with a constant set of columns.
These statements result in the following SAS output:
libname test xml 'C:\My Documents\test\students.xml'; proc print data=test.students; run;
The SAS System 1 Obs STATE CITY ADDRESS NAME ID 1 Texas Huntsville 1611 Glengreen Brad Martin 755 2 Texas Houston 11900 Glenda Zac Harvell 1522 . . .
Why Is a Specific Physical Structure Required? |
Well-formed XML is determined by structure, not content. Therefore, while the XML engine can assume that the XML document is valid, well-formed XML, the engine cannot assume that the root element encloses only instances of a single node element, that is, only a single data set. Therefore, the XML engine has to account for the possibility of multiple nodes, that is, multiple SAS data sets.
For example, when the following correctly structured XML document is imported, it is recognized as containing two SAS data sets: HIGHTEMP and LOWTEMP.
<?xml version="1.0" encoding="windows-1252" ?> <CLIMATE> 1 <HIGHTEMP> 2 <PLACE> Libya </PLACE> <DATE> 1922-09-13 </DATE> <DEGREE-F> 136 </DEGREE-F> <DEGREE-C> 58 </DEGREE-C> </HIGHTEMP> . . more instances of <HIGHTEMP> . <LOWTEMP> 3 <PLACE> Antarctica </PLACE> <DATE> 1983-07-21 </DATE> <DEGREE-F> -129 </DEGREE-F> <DEGREE-C> -89 </DEGREE-C> </LOWTEMP> . . more instances of <LOWTEMP> . </CLIMATE>
When the previous XML document is imported, the following happens:
The XML engine recognizes the first instance tag <CLIMATE> as the root-enclosing element, which is the container for the document.
Starting with the second-level instance tag, which is <HIGHTEMP>, the XML engine uses the repeating element instances as a collection of rows with a constant set of columns.
When the second-level instance tag changes, the XML engine interprets that change as a different SAS data set.
The result is two SAS data sets: HIGHTEMP and LOWTEMP. Both happen to have the same variables, but of course, different data.
To ensure that an import result is what you expect, use the DATASETS procedure. For example, these SAS statements result in the following:
libname climate xml 'C:\My Documents\xml\climate.xml'; proc datasets library=climate; quit;
PROC DATASETS Output for CLIMATE Library
Directory Libref CLIMATE Engine XML Physical Name C:\My Documents\xml\climate.xml XMLType GENERIC XMLMap NO XMLMAP IN EFFECT Member # Name Type 1 HIGHTEMP DATA 2 LOWTEMP DATA
Handling XML Documents That Are Not in the Required Physical Structure |
If your XML document is not in the required physical structure, you can tell the XML engine how to interpret the XML markup in order to successfully import the document. See Importing XML Documents Using an XMLMap.
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.