SAS Institute. The Power to Know

FOCUS AREAS

Return to previous page

Base SAS

Using the XMLMap Option for XML LIBNAME Engine in SAS 8.2

Contents


Introduction

To successfully import an XML document, SAS requires a specific XML physical structure in order to identify columns of data from collections of rows. If an XML document does not represent the required physical structure, the results can be unexpected and unwanted. Why Can't SAS Import My XML Document? explains the XML physical structure that SAS requires and also explains why the physical structure is required.

If your XML document does not import successfully, rather than transform the XML document, you can now use the enhanced SAS XML LIBNAME engine, which is available as an add on to Release 8.2. The enhanced SAS XML LIBNAME engine provides additional syntax in order to successfully import an XML document. With the new syntax, you tell SAS how to map XML markup into SAS format. This document describes the new XMLMap option and explains how to code the XMLMap file.


What is the XMLMap Option?

XMLMap is an option that you specify either on the LIBNAME XML statement or as a SAS data set option. XMLMap specifies a file that contains specific map syntax. The syntax tells SAS how to map the XML markup (contained in an XML document to be imported) into SAS data set(s), variables, and observations.

XMLMap Option Syntax

The syntax for the option is:

XMLMAP=fileref | 'external-file'

fileref
is the SAS name that is associated with the physical location of the XMLMap file. To assign a fileref, use the FILENAME statement.

'external-file'
is the physical location of the XMLMap file. Include the complete pathname and the file name. It is suggested that you use the file name extension .map.

Examples of Specifying XMLMap Option

The following examples import an XML document named MY.XML and specify the XMLMap file MY.MAP, which contains specific map syntax.

The first example specifies XMLMap as an option on the LIBNAME, XML statement :

   libname test xml 'C:\XMLdata\my.xml' xmlmap='C:\XMLdata\my.map';

   proc print data=test.my;
   run;

This example uses XMLMap as a data set option and also uses a fileref that is assigned to the XMLMap file:

   filename map 'C:\XMLdata\my.map';
   libname test xml 'C:\XMLdata\my.xml';

   proc print data=test.my (xmlmap=map);
   run;

TIP: For the LIBNAME, XML statement, you can also use a fileref that is associated with the physical location of the XML document. For example:

   filename myxml 'C:\XMLdata\my.xml';
   filename map 'C:\XMLdata\my.map';
   libname myxml xml xmlmap=map;

   proc print data=myxml.my;
   run;


What is an XMLMap File?

In order to tell the XML engine how to translate XML markup, you create a separate file that contains specific map syntax. The syntax defines the observation boundaries, variables (columns), and variable attributes to be imported from the XML document.

TIP: To display an XMLMap file (with the file name extension .map) with Microsoft Internet Explorer5, follow these steps so that the file can be viewed with Internet Explorer as XML, which provides a minimal validation for the syntax:

  1. In the Windows environment, left click an XMLMap file name to highlight the name.
  2. Press the Shift key and right click the highlighted name to display a list of choices.
  3. Select Open With...
  4. Select iexplore, and select Always use this program to open this type of file.
  5. Select OK.

Example of an XMLMap File

An example of an XMLMap file is RSS.MAP, which includes the syntax so that SAS can translate the XML document RSS.XML *.

* The XML document RSS.XML uses the XML format RSS (Rich Site Summary), which was designed by Netscape originally for exchange of content within the My Netscape Network (MNN) community. The RSS format has been widely adopted for sharing headlines and other Web content and is a good example of XML as a transmission format. For more on RSS, see Introduction to RSS.

Map Syntax Elements

The following are the XML tag names for the map syntax elements. The tag names are listed in the order in which you would typically code them in your XMLMap file:

SXLEMAP VERSION="1.0"
is the primary (root) enclosing element to contain the data set definitions. The element provides the XML well-formed constraint for the definitions. SXLEMAP has this attribute:

  • VERSION="1.0" specifies the version of the XMLMap syntax. In subsequent releases, the VERSION attribute will determine the supported syntax.

The SXLEMAP element can contain one or more TABLE elements. For example:

<SXLEMAP version="1.0">
   <TABLE name="test1">
      .
      .
      .
    </TABLE>
    <TABLE name="test2">
       .
       .
       .
     </TABLE>
</SXLEMAP>

TABLE NAME="name"
is an element to contain a data set definition. For example, <TABLE name="channel">. TABLE has this attribute:

  • NAME= specifies the name for the SAS data set. The name must be a valid SAS name and is required.

The TABLE element can contain one or more of the following elements that describe the data set attributes: TABLE_XPATH, TABLE_END_XPATH, TABLE_LABEL, and COLUMN.

TABLE_XPATH
specifies a location path* that defines where in the XML document to collect variables for the SAS data set. This is a required element. The location path determines the repeating element instances, which are translated into a collection of rows with a constant set of columns. This is the observation boundary for the data. For example:

<TABLE_XPATH> /rss/channel </TABLE_XPATH>

The above example causes SAS to do the following:

  1. SAS reads the XML markup until it encounters the begin tag <CHANNEL>.

  2. SAS clears the input buffer, sets the contents to MISSING (by default), and scans for variable names from elements. As variable values are encountered, they are read into the input buffer.

  3. When the end tag </CHANNEL> is encountered, SAS writes the completed input buffer to the SAS data set.

  4. The process would be repeated for each <CHANNEL> begin tag and </CHANNEL> end tag sequence.

NOTE: Whether SAS resets to MISSING is determined by the DEFAULT element as well as the RETAIN= attribute for the COLUMN element.

* A location path tells how to locate and access specific elements in the XML document. Specify a valid Xpath construction in conformance with the World Wide Web Consortium (W3C). Note that Xpath syntax is case sensitive. All paths must begin with the root enclosing element (denoted by a slash '/') or with the "any parent" variant (denoted by double slashes '//'). Other W3C documented forms are not currently supported.

TABLE_END_XPATH BeginEnd="Begin | End"
specifies a location path* that tells SAS when to stop processing data for the data set. This is an optional optimization feature to save processing resources. By default, processing continues until the physical end-of-file is encountered. TABLE_END_XPATH has this attribute:

  • BeginEnd="Begin | End" says to stop processing when either the begin element tag is encountered or the end element tag is encountered. This attribute is required.

For example, in the XML document RSS.XML, there is only one <CHANNEL> begin tag and one </CHANNEL> end tag. With the TABLE_XPATH location path

<TABLE_XPATH> /rss/channel </TABLE_XPATH>

SAS would process the entire XML document, even though it does not store new data into the input buffer after it encounters the first <ITEM> begin tag, because the remaining elements no longer qualify. The following tells SAS to stop processing when the <ITEM> begin tag is encountered:

<TABLE_END_XPATH BeginEnd="Begin"> /rss/channel/item </TABLE_END_XPATH>

Therefore, with the two location path specifications, SAS would process only the highlighted data in the RSS.XML document for the CHANNEL data set, rather than the entire XML document:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<rss version="0.91">
   <channel>
      <title>WriteTheWeb</title>
      <link>http://writetheweb.com</link>
      <description>News for web users that write back
         </description>
      <language>en-us</language>
      <copyright>Copyright 2000, WriteTheWeb team.
         </copyright>
      <managingEditor>editor@writetheweb.com
         </managingEditor>
      <webMaster>webmaster@writetheweb.com</webMaster>
      <image>
         <title>WriteTheWeb</title>
         <url>http://writetheweb.com/images/mynetscape88.gif
            </url>
         <link>http://writetheweb.com</link>
         <width>88</width>
         <height>31</height>
         <description>News for web users that write back
            </description>
         </image> 
      <item>
         .
         .
         .
   </channel>
</rss>

* A location path tells how to locate and access specific elements in the XML document. Specify a valid Xpath construction in conformance with the World Wide Web Consortium (W3C). Note that Xpath syntax is case sensitive. All paths must begin with the root enclosing element (denoted by a slash '/') or with the "any parent" variant (denoted by double slashes '//'). Other W3C documented forms are not currently supported.

TABLE_LABEL | TABLE_DESCRIPTION
specifies a description for the data set, which can be up to 256 characters. For example:

<TABLE_LABEL>Data Set contains TV channel information
</TABLE_LABEL>

COLUMN NAME="name" RETAIN=NO | YES
is an element to contain a variable definition. For example:

<COLUMN name="title">

COLUMN has these attributes:

  • NAME= specifies the name for the variable. The name must be a valid SAS name and is required.

  • RETAIN=NO, which is the default, sets the value to MISSING at the beginning of each observation. RETAIN=YES keeps the current value until it is replaced by a new, non-missing value.

COLUMN can contain one or more of the following elements that describe the variable attributes: DATATYPE, DEFAULT, ENUM, FORMAT, INFORMAT, LABEL, LENGTH, TYPE, and XPATH.

DATATYPE
specifies the type of data being read from the XML document for the variable. This is a required element. For example, the following specifies that the data contains alphanumeric characters:

<DATATYPE> string </DATATYPE>

The type of data specification can be:

STRING
specifies that the data contains alphanumeric characters that does not contain numbers used for calculations.

INT
specifies that the data contains whole numbers used for calculations.

FLOAT
specifies that the data contains floating point numbers.

DT-8601
specifies a date and time value that is in the ISO 8601 date and time format: yyy-mm-ddThh:mm:ss.nnnnnn.

DAT-8601
specifies a date value that is in the ISO 8601 date format: yyy-mm-dd.

TIM-8601
specifies a time value that is in the ISO 8601 time format: hh:mm:ss.nnnnnn.

DEFAULT
specifies a default value for a missing value for the variable. Use the DEFAULT element to assign a non-missing value to missing data. By default, SAS sets a missing value to MISSING. For example, by including the following, SAS will assign the value single when a missing value occurs:

<DEFAULT> single </DEFAULT>

ENUM
is an element to contain a list of valid values for the variable. ENUM can contain one or more VALUE elements to specify the values.

By using ENUM, values in the XML document are verified against the list of values. If a value is not valid, then it is either set to MISSING (by default) or set to the value specified by the DEFAULT element. Note that a value specified for DEFAULT must be one of the ENUM values in order to be valid. For example:

<COLUMN name="filing_status">
   .
   .
   .
   <DEFAULT> single </DEFAULT>
   .
   .
   .
   <ENUM>
      <VALUE> single </VALUE>
      <VALUE> married filing joint return </VALUE>
      <VALUE> married filing separate return </VALUE>
      <VALUE> head of household </VALUE>
      <VALUE> qualifying widow(er) </VALUE>
   </ENUM>
</COLUMN>

FORMAT WIDTH=w NDEC=d
specifies a SAS format for the variable. A SAS format is an instruction that SAS uses to write values. You use formats to control the written appearance of values. Do not include the period (.) as part of the format name. Specify a width and length as attributes, not as part of the format name. FORMAT has these attributes:

  • WIDTH= specifies an optional format width, which for most formats is the number of columns in the output data.

  • NDEC= specifies an optional decimal scaling factor for numeric formats.

For example,

<FORMAT> IS8601DA </FORMAT>
<FORMAT WIDTH="8"> best </FORMAT>
<FORMAT WIDTH="8" NDEC="2"> dollar </FORMAT>

INFORMAT WIDTH=w NDEC=d
specifies a SAS informat for the variable. A SAS informat is an instruction that SAS uses to read values into a variable, that is, to store the values. Do not include the period (.) as part of the informat name. Specify a width and length as attributes, not as part of the informat name. INFORMAT has these attributes:

  • WIDTH= specifies an optional informat width, which for most informats is the number of columns in the input data.

  • NDEC= specifies an optional decimal scaling factor for numeric informats. SAS divides the input data by 10 to the power of this value.

For example,

<INFORMAT> IS8601DA </INFORMAT>
<INFORMAT WIDTH="8"> best </INFORMAT>
<INFORMAT WIDTH="8" NDEC="2"> dollar </INFORMAT>

LABEL | DESCRIPTION
specifies a description for the variable, which can be up to 256 characters. For example:

<LABEL>Story link</LABEL>

LENGTH
is the incoming field length (number of bytes) from the XML data. For character data, the specified length is also the storage length for the variable. The value can be 1 to 32767. This is a required element. For example:

<LENGTH> 200 </LENGTH>

TIP: You can use LENGTH to truncate a long field.

TIP: To assign a floating point type, use <DATATYPE> float </DATATYPE>.

TIP: To apply formating, use the FORMAT element.

TYPE
specifies the SAS data type, which can be CHARACTER or NUMERIC. This is how SAS stores the data and is a required element. For example, the following specifies that the SAS data type for the variable is numeric:

<TYPE>numeric</TYPE>

XPATH
specifies a location path* that tells SAS where to locate the current tag, then performs a function as determined by the form of the location path. This is a required element. To specify the location path, use one of the following forms:

element form
accesses parsed character data from the named element. For example:

<XPATH> /rss/channel/title </XPATH>

The above example tells SAS to scan the XML markup until it finds the specific TITLE element. SAS retrieves the value between the <TITLE> begin tag and the </TITLE> end tag That is, for the TITLE variable in the CHANNEL data set, SAS would retrieve the highlighted value in the RSS.XML document:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<rss version="0.91">
   <channel>
      <title>WriteTheWeb</title>
      <link>http://writetheweb.com</link>
      <description>News for web users that write back
         </description>
      <language>en-us</language>
      <copyright>Copyright 2000, WriteTheWeb team.
         </copyright>
      <managingEditor>editor@writetheweb.com
         </managingEditor>
      <webMaster>webmaster@writetheweb.com</webMaster>
      <image>
         <title>WriteTheWeb</title>
         <url>http://writetheweb.com/images/mynetscape88.gif
            </url>
         <link>http://writetheweb.com</link>
         <width>88</width>
         <height>31</height>
         <description>News for web users that write back
            </description>
         </image>
      <item>
         .
         .
         .
   </channel>
</rss>

attribute form
accesses data from the named attribute (of the form name="value"). For example:

<XPATH> /rss@version </XPATH>

The above example tells SAS to scan the XML markup until it finds the specific RSS element. SAS retrieves the value from the VERSION attribute in the RSS element. That is, for the VERSION variable in the CHANNEL data set, SAS would retrieve the highlighted value in the RSS.XML document:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<rss version="0.91">
   <channel>
      <title>WriteTheWeb</title>
      <link>http://writetheweb.com</link>
      <description>News for web users that write back
         </description>
      <language>en-us</language>
      <copyright>Copyright 2000, WriteTheWeb team.
         </copyright>
      <managingEditor>editor@writetheweb.com
         </managingEditor>
      <webMaster>webmaster@writetheweb.com</webMaster>
      <image>
         <title>WriteTheWeb</title>
         <url>http://writetheweb.com/images/mynetscape88.gif
            </url>
         <link>http://writetheweb.com</link>
         <width>88</width>
         <height>31</height>
         <description>News for web users that write back
            </description>
         </image>
      <item>
         .
         .
         .
   </channel>
</rss>

value form
accesses parsed character data from the named element with a specific attribute value. For example:

<XPATH> /constant[@name="PI"] </XPATH>

If the XML contains the following, the above example tells SAS to scan the XML markup until it finds the specific CONSTANT element where the value of the NAME= attribute is PI. SAS would retrieve the value 3.14159.

<constant name="PI">3.14159</constant>

* A location path tells how to locate and access specific elements in the XML document. Specify a valid Xpath construction in conformance with the World Wide Web Consortium (W3C). Note that Xpath syntax is case sensitive. All paths must begin with the root enclosing element (denoted by a slash '/') or with the "any parent" variant (denoted by double slashes '//'). Other W3C documented forms are not currently supported.


Example: Importing an XML Document Using XMLMap

As explained in Why Can't SAS Import My XML Document?, the result of importing the XML document RSS.XML is not acceptable, because the XML in RSS.XML is incorrectly structured for SAS to translate successfully.

Rather than transform RSS.XML using XSL, the XML document can be successfully imported by creating an XMLMap file that tells SAS how to translate the XML markup.

Create the XMLMap File

The file RSS.MAP contains the map syntax in order to successfully import the XML document RSS.XML. The syntax tells SAS how to translate the XML markup as explained in the following table. The contents of RSS.XML will result in two SAS data sets:

Map Syntax Description
<SXLEMAP version="1.0">
Root enclosing element for SAS data set definitions.

<TABLE name="CHANNEL">
Element for the CHANNEL data set definition.

<TABLE_XPATH>/rss/channel</TABLE_XPATH>
Element specifying the location path that defines where in the XML document to collect variables for the CHANNEL data set.

<TABLE_END_XPATH
BeginEnd="Begin">/rss/channel/item
</TABLE_END_XPATH>
Element specifying the location path that tells SAS when to stop processing data for the CHANNEL data set.

<COLUMN name="title">
 <XPATH>/rss/channel/title</XPATH>
 <TYPE>character</TYPE>
 <DATATYPE>string</DATATYPE>
 <LENGTH>200</LENGTH>
</COLUMN>
Element containing the attributes for the TITLE variable in the CHANNEL data set. The Xpath construction tells SAS where to find the current tag and to access data from the named element.

<COLUMN>
   .
   .
   .
</COLUMN>
Subsequent COLUMN elements define the variables LINK, DESCRIPTION, and LANGUAGE for the CHANNEL data set.

<COLUMN name="version">
 <XPATH>/rss@version</XPATH>
 <TYPE>character</TYPE>
 <DATATYPE>string</DATATYPE>
 <LENGTH>8</LENGTH>
</COLUMN>
Element containing the attributes for the last variable in the CHANNEL data set, which is VERSION. This Xpath construction tells SAS where to find the current tag and uses the attribute form to access data from the named attribute.

<TABLE name="ITEMS">
<TABLE_XPATH>/rss/channel/item</TABLE_XPATH>
When the second-level element changes, SAS interprets a different SAS data set, which is ITEMS.

<COLUMN name="title">
 <XPATH>/rss/channel/item/title</XPATH>
 <TYPE>character</TYPE>
 <DATATYPE>string</DATATYPE>
 <LENGTH>200</LENGTH>
</COLUMN>
Element containing the attributes for the TITLE variable in the ITEMS data set.

<COLUMN>
   .
   .
   .
</COLUMN>
Subsequent COLUMN elements define other variables for the ITEMS data set, which are URL and DESCRIPTION.

Import the XML Document

The following SAS statements import the XML document RSS.XML and specify the XMLMap file RSS.MAP:

   filename map 'C:\My Documents\xml\rss.map';
   libname rss xml 'C:\My Documents\xml\rss.xml' xmlmap=map;

Verify the Import

Use the DATASETS procedure to verify the import result.

   proc datasets library=rss;

                                -----Directory-----

                      Libref:        RSS
                      Engine:        XML
                      Physical Name: C:\My Documents\xml\rss.xml
                      XMLType:       GENERIC
                      XMLMap:        MAP


                                #  Name     Memtype
                                -------------------
                                1  CHANNEL  DATA
                                2  ITEMS    DATA

View the Data Set Attributes

Use the CONTENTS procedure to view the data set attributes:

   proc contents data=rss.channel;
   run;

   proc contents data=rss.items;
   run;

Here is the PROC CONTENTS output for RSS.CHANNEL and RSS.ITEMS.


Your Turn

Questions or comments? Send electronic mail to XMLEngine@sas.com with your comments.