Previous Page | Next Page

Importing XML Documents Using an XMLMap

Using an XMLMap to Import an XML Document as Multiple SAS Data Sets

This example explains how to create and use an XMLMap in order to define how to map XML markup into two SAS data sets. The example uses the XML document RSS.XML, which does not import successfully because its XML markup is incorrectly structured for the XML engine to translate successfully.

Note:   The XML document RSS.XML uses the XML format RSS (Rich Site Summary), which was designed by Netscape originally for exchange of content within the My Netscape Network (MNN) community. The RSS format has been widely adopted for sharing headlines and other Web content and is a good example of XML as a transmission format.  [cautionend]

First, here is the XML document RSS.XML to be imported:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<rss version="0.91">
   <channel>
      <title>WriteTheWeb</title> 
	  <link>http://writetheweb.com</link> 
      <description>News for web users that write back</description> 
      <language>en-us</language> 
      <copyright>Copyright 2000, WriteTheWeb team.</copyright> 
      <managingEditor>editor@writetheweb.com</managingEditor> 
      <webMaster>webmaster@writetheweb.com</webMaster> 
      <image>
         <title>WriteTheWeb</title> 
		 <url>http://writetheweb.com/images/mynetscape88.gif</url> 
		 <link>http://writetheweb.com</link> 
		 <width>88</width> 
		 <height>31</height> 
		 <description>News for web users that write back</description> 
	  </image>
      <item>
         <title>Giving the world a pluggable Gnutella</title> 
		 <link>http://writetheweb.com/read.php?item=24</link> 
		 <description>WorldOS is a framework on which to build programs that work 
like Freenet or Gnutella -allowing distributed applications using 
peer-to-peer routing.</description> 
      </item>
      <item>
         <title>Syndication discussions hot up</title> 
		 <link>http://writetheweb.com/read.php?item=23</link> 
		 <description>After a period of dormancy, the Syndication mailing list 
has become active again, with contributions from leaders in traditional media 
and Web syndication.</description> 
      </item>
	  <item>
         <title>Personal web server integrates file sharing and messaging
   </title> 
		 <link>http://writetheweb.com/read.php?item=22</link> 
		 <description>The Magi Project is an innovative project to create a 
combined personal web server and messaging system that enables the sharing 
and synchronization of information across desktop, laptop and palmtop devices.
    </description> 
      </item>
      <item>
         <title>Syndication and Metadata</title> 
		 <link>http://writetheweb.com/read.php?item=21</link> 
		 <description>RSS is probably the best known metadata format around. 
RDF is probably one of the least understood. In this essay, published on my 
O&apos;Reilly Network weblog, I argue that the next generation of RSS 
should be based on RDF.</description> 
      </item>
      <item>
         <title>UK bloggers get organized</title> 
		 <link>http://writetheweb.com/read.php?item=20</link> 
		 <description>Looks like the weblogs scene is gathering pace beyond 
the shores of the US. There&apos;s now a UK-specific page on weblogs.com, 
and a mailing list at egroups.</description> 
      </item>
      <item>
         <title>Yournamehere.com more important than anything</title> 
		 <link>http://writetheweb.com/read.php?item=19</link> 
		 <description>Whatever you&apos;re publishing on the web, your site 
name is the most valuable asset you have, according to Carl Steadman.
     </description> 
      </item>
   </channel>
</rss>

The XML document can be successfully imported by creating an XMLMap that defines how to map the XML markup. The following is the XMLMap named RSS.MAP, which contains the syntax that is needed to successfully import RSS.XML. The syntax tells the XML engine how to interpret the XML markup as explained in the subsequent descriptions. The contents of RSS.XML results in two SAS data sets: CHANNEL to contain content information and ITEMS to contain the individual news stories.

<?xml version="1.0" ?>

<SXLEMAP version="1.2"> 1 
   
   <!-- TABLE (CHANNEL) -->
   <!-- top level channel content description (TOC) -->
   <TABLE name="CHANNEL"> 2 
      <TABLE-PATH syntax="XPATH"> /rss/channel </TABLE-PATH> 3 
      <TABLE-END-PATH syntax="XPATH" beginend="BEGIN"> 
         /rss/channel/item </TABLE-END-PATH> 4 

      <!-- title -->
      <COLUMN name="title"> 5 
         <PATH> /rss/channel/title </PATH>
         <TYPE> character </TYPE>
         <DATATYPE> string </DATATYPE>
         <LENGTH> 200 </LENGTH>
      </COLUMN>

      <!-- link -->
      <COLUMN name="link"> 6 
         <PATH> /rss/channel/link </PATH>
         <TYPE> character </TYPE>
         <DATATYPE> string </DATATYPE>
         <LENGTH> 200 </LENGTH>
         <DESCRIPTION> Story link </DESCRIPTION>
      </COLUMN>

      <!-- description -->
      <COLUMN name="description">
         <PATH> /rss/channel/description </PATH>
         <TYPE> character </TYPE>
         <DATATYPE> string </DATATYPE>
         <LENGTH> 1024 </LENGTH>
      </COLUMN>

      <!-- language -->
      <COLUMN name="language">
         <PATH> /rss/channel/language </PATH>
         <TYPE> character </TYPE>
         <DATATYPE> string </DATATYPE>
         <LENGTH> 8 </LENGTH>
      </COLUMN>

      <!-- version -->
      <COLUMN name="version"> 7 
         <PATH> /rss@version </PATH>
         <TYPE> character </TYPE>
         <DATATYPE> string </DATATYPE>
         <LENGTH> 8 </LENGTH>
      </COLUMN>
   </TABLE>

   
   <!-- TABLE (ITEMS) -->
   <!-- individual news stories -->
   <TABLE name="ITEMS"> 8 
      <TABLE-PATH syntax="XPATH"> /rss/channel/item </TABLE-PATH>
      <TABLE-DESCRIPTION> Individual news stories </TABLE-DESCRIPTION>

      <!-- title -->
      <COLUMN name="title"> 9 
         <PATH> /rss/channel/item/title </PATH>
         <TYPE> character </TYPE>
         <DATATYPE> string </DATATYPE>
         <LENGTH> 200 </LENGTH>
      </COLUMN>

      <!-- link -->
      <!-- link is renamed to url, assigned a label and max length -->
      <COLUMN name="URL"> 10 
         <PATH> /rss/channel/item/link </PATH>
         <TYPE> character </TYPE>
         <DATATYPE> string </DATATYPE>
         <LENGTH> 200 </LENGTH>
         <DESCRIPTION> Story link </DESCRIPTION>
      </COLUMN>

      <!-- description -->
      <COLUMN name="description">
         <PATH> /rss/channel/item/description </PATH>
         <TYPE> character </TYPE>
         <DATATYPE> string </DATATYPE>
         <LENGTH> 1024 </LENGTH>
      </COLUMN>
   </TABLE>

</SXLEMAP>

The previous XMLMap defines how to translate the XML markup as explained below:

  1. Root-enclosing element for SAS data set definitions.

  2. Element for the CHANNEL data set definition.

  3. Element specifying the location path that defines where in the XML document to collect variables for the CHANNEL data set.

  4. Element specifying the location path that specifies when to stop processing data for the CHANNEL data set.

  5. Element containing the attributes for the TITLE variable in the CHANNEL data set. The XPath construction specifies where to find the current tag and to access data from the named element.

  6. Subsequent COLUMN elements define the variables LINK, DESCRIPTION, and LANGUAGE for the CHANNEL data set.

  7. Element containing the attributes for the last variable in the CHANNEL data set, which is VERSION. This XPath construction specifies where to find the current tag and uses the attribute form to access data from the named attribute.

  8. Element for the ITEMS data set definition.

  9. Element containing the attributes for the TITLE variable in the ITEMS data set.

  10. Subsequent COLUMN elements define other variables for the ITEMS data set, which are URL and DESCRIPTION.

The following SAS statements import the XML document RSS.XML and specify the XMLMap named RSS.MAP. The DATASETS procedure then verifies the import results:

filename rss 'C:\My Documents\xml\rss.xml';
filename map 'C:\My Documents\xml\rss.map';

libname rss xml xmlmap=map access=readonly;

proc datasets library=rss;
run;
quit;

PROC DATASETS Output for RSS Library Showing Two Data Sets

                                            Directory

                                     Libref         RSS
                                     Engine         XML
                                     Access         READONLY
                                     Physical Name  RSS
                                     XMLType        GENERIC
                                     XMLMap         MAP


                                                   Member
                                       #  Name     Type

                                       1  CHANNEL  DATA
                                       2  ITEMS    DATA

Previous Page | Next Page | Top of Page