This example explains
how to create and use an XMLMap in order to define how to map XML
markup into two SAS data sets. The example uses the XML document RSS.XML,
which does not import successfully because its XML markup is incorrectly
structured for the XML engine to translate successfully.
Note: The XML document RSS.XML
uses the XML format RSS (Rich Site Summary), which was designed by
Netscape originally for exchange of content within the My Netscape
Network (MNN) community. The RSS format has been widely adopted for
sharing headlines and other Web content and is a good example of XML
as a transmission format.
Here is the XML document
RSS.XML to be imported:
<?xml version="1.0" encoding="ISO-8859-1" ?>
- <rss version="0.91">
- <channel>
<title>WriteTheWeb</title>
<link>http://writetheweb.com</link>
<description>News for web users that write back</description>
<language>en-us</language>
<copyright>Copyright 2000, WriteTheWeb team.</copyright>
<managingEditor>editor@writetheweb.com</managingEditor>
<webMaster>webmaster@writetheweb.com</webMaster>
- <image>
<title>WriteTheWeb</title>
<url>http://writetheweb.com/images/mynetscape88.gif</url>
<link>http://writetheweb.com</link>
<width>88</width>
<height>31</height>
<description>News for web users that write back</description>
</image>
- <item>
<title>Giving the world a pluggable Gnutella</title>
<link>http://writetheweb.com/read.php?item=24</link>
<description>WorldOS is a framework on which to build programs that work like
Freenet or Gnutella -allowing distributed applications using peer-to-peer
routing.
</description>
</item>
- <item>
<title>Syndication discussions hot up</title>
<link>http://writetheweb.com/read.php?item=23</link>
<description>After a period of dormancy, the Syndication mailing list has become
active again, with contributions from leaders in traditional media and Web
syndication.
</description>
</item>
- <item>
<title>Personal web server integrates file sharing and messaging</title>
<link>http://writetheweb.com/read.php?item=22</link>
<description>The Magi Project is an innovative project to create a combined personal
web server and messaging system that enables the sharing and synchronization of
information across desktop, laptop and palmtop devices.</description>
</item>
- <item>
<title>Syndication and Metadata</title>
<link>http://writetheweb.com/read.php?item=21</link>
<description>RSS is probably the best known metadata format around. RDF is probably
one of the least understood. In this essay, published on my O'Reilly Network
weblog, I argue that the next generation of RSS should be based on RDF.
</description>
</item>
- <item>
<title>UK bloggers get organised</title>
<link>http://writetheweb.com/read.php?item=20</link>
<description>Looks like the weblogs scene is gathering pace beyond the shores of the
US. There's now a UK-specific page on weblogs.com, and a mailing list at egroups.
</description>
</item>
- <item>
<title>Yournamehere.com more important than anything</title>
<link>http://writetheweb.com/read.php?item=19</link>
<description>Whatever you're publishing on the web, your site name is the most
valuable asset you have, according to Carl Steadman.</description>
</item>
</channel>
</rss>
The XML document can
be successfully imported by creating an XMLMap that defines how to
map the XML markup. The following is the XMLMap named RSS.MAP, which
contains the syntax that is needed to successfully import RSS.XML.
The syntax tells the XML engine how to interpret the XML markup as
explained in the subsequent descriptions. The contents of RSS.XML
results in two SAS data sets: CHANNEL to contain content information
and ITEMS to contain the individual news stories.
<?xml version="1.0" encoding="UTF-8"?>
<SXLEMAP name="SXLEMap" version="2.1"> 1
<TABLE name="CHANNEL"> 2
<TABLE-PATH syntax="XPath">/rss/channel</TABLE-PATH> 3
<TABLE-END-PATH beginend="BEGIN" syntax="XPath">
/rss/channel/item</TABLE-END-PATH> 4
<COLUMN name="title"> 5
<PATH syntax="XPath">/rss/channel/title</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>200</LENGTH>
</COLUMN>
<COLUMN name="link"> 6
<PATH syntax="XPath">/rss/channel/link</PATH>
<DESCRIPTION>Story link</DESCRIPTION>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>200</LENGTH>
</COLUMN>
<COLUMN name="description">
<PATH syntax="XPath">/rss/channel/description</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>1024</LENGTH>
</COLUMN>
<COLUMN name="language">
<PATH syntax="XPath">/rss/channel/language</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>8</LENGTH>
</COLUMN>
<COLUMN name="version"> 7
<PATH syntax="XPath">/rss@version</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>8</LENGTH>
</COLUMN>
</TABLE>
<TABLE description="Individual news stories" name="ITEMS"> 8
<TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH>
<COLUMN name="title"> 9
<PATH syntax="XPath">/rss/channel/item/title</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>200</LENGTH>
</COLUMN>
<COLUMN name="URL"> 10
<PATH syntax="XPath">/rss/channel/item/link</PATH>
<DESCRIPTION>Story link</DESCRIPTION>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>200</LENGTH>
</COLUMN>
<COLUMN name="description"> 10
<PATH syntax="XPath">/rss/channel/item/description</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>1024</LENGTH>
</COLUMN>
</TABLE>
</SXLEMAP>
The previous XMLMap defines how to translate the XML markup
as explained below:
1 |
Root-enclosing
element for SAS data set definitions.
|
2 |
Element
for the CHANNEL data set definition.
|
3 |
Element
specifying the location path that defines where in the XML document
to collect variables for the CHANNEL data set.
|
4 |
Element
specifying the location path that specifies when to stop processing
data for the CHANNEL data set.
|
5 |
Element
containing the attributes for the TITLE variable in the CHANNEL data
set. The XPath construction specifies where to find the current tag
and to access data from the named element.
|
6 |
Subsequent
COLUMN elements define the variables LINK, DESCRIPTION, and LANGUAGE
for the CHANNEL data set.
|
7 |
Element
containing the attributes for the last variable in the CHANNEL data
set, which is VERSION. This XPath construction specifies where to
find the current tag and uses the attribute form to access data from
the named attribute.
|
8 |
Element
for the ITEMS data set definition.
|
9 |
Element
containing the attributes for the TITLE variable in the ITEMS data
set.
|
10 |
Subsequent
COLUMN elements define other variables for the ITEMS data set, which
are URL and DESCRIPTION.
|
The following SAS statements
import the XML document RSS.XML and specify the XMLMap named RSS.MAP.
The DATASETS procedure then verifies the import results.
filename rss 'C:\My Documents\rss.xml';
filename map 'C:\My Documents\rss.map';
libname rss xmlv2 xmlmap=map access=readonly;
proc datasets library=rss;
run;
quit;
DATASETS Procedure Output for RSS Library Showing Two Data
Sets