Importing XML Documents Using an XMLMap |
This example explains how to create and use an XMLMap in order to tell the XML engine how to map XML markup to a SAS data set, variables, and observations.
First, here is the XML document NHL.XML to be imported. Although simply constructed and relatively easy for you to read, it does not import successfully because its XML markup is not in the required physical structure:
<?xml version="1.0" encoding="iso-8859-1" ?> <NHL> <CONFERENCE> Eastern <DIVISION> Southeast <TEAM name="Thrashers" abbrev="ATL" /> <TEAM name="Hurricanes" abbrev="CAR" /> <TEAM name="Panthers" abbrev="FLA" /> <TEAM name="Lightning" abbrev="TB" /> <TEAM name="Capitals" abbrev="WSH" /> </DIVISION> </CONFERENCE> <CONFERENCE> Western <DIVISION> Pacific <TEAM name="Stars" abbrev="DAL" /> <TEAM name="Kings" abbrev="LA" /> <TEAM name="Ducks" abbrev="ANA" /> <TEAM name="Coyotes" abbrev="PHX" /> <TEAM name="Sharks" abbrev="SJ" /> </DIVISION> </CONFERENCE> </NHL>
To successfully import the XML document, an XMLMap is needed. After familiarizing yourself with the data to be imported, you can code the XMLMap syntax so that the data is successfully imported. Here is the XMLMap used to import the XML document, with notations for the data investigation:
<?xml version="1.0" ?> <SXLEMAP version="1.2"> <TABLE name="TEAMS"> 1 <TABLE-PATH syntax="XPATH"> 2 /NHL/CONFERENCE/DIVISION/TEAM </TABLE-PATH> <COLUMN name="NAME"> 3 <PATH> 5 /NHL/CONFERENCE/DIVISION/TEAM@name </PATH> <TYPE>character</TYPE> <DATATYPE>STRING</DATATYPE> <LENGTH>30</LENGTH> </COLUMN> <COLUMN name="ABBREV"> 3 <PATH> 5 /NHL/CONFERENCE/DIVISION/TEAM/@abbrev </PATH> <TYPE>character</TYPE> <DATATYPE>STRING</DATATYPE> <LENGTH>3</LENGTH> </COLUMN> <COLUMN name="CONFERENCE" retain="YES"> 4 <PATH>/NHL/CONFERENCE</PATH> 5 <TYPE>character</TYPE> <DATATYPE>STRING</DATATYPE> <LENGTH>10</LENGTH> </COLUMN> <COLUMN name="DIVISION" retain="YES"> 4 <PATH> 5 /NHL/CONFERENCE/DIVISION </PATH> <TYPE>character</TYPE> <DATATYPE>STRING</DATATYPE> <LENGTH>10</LENGTH> </COLUMN> </TABLE> </SXLEMAP>
The previous XMLMap syntax defines how to translate the XML markup as explained below, using the following data investigation steps:
Locate and identify distinct tables of information.
You want a SAS data set (table) that contains some of the teams of the National Hockey League. Because that is the only information contained in the XML document, you can define a single data set named TEAMS in the XMLMap. (Note that other XML documents might contain more than one table of related information. Importing multiple tables is supported by the XMLMap syntax as shown in Using an XMLMap to Import an XML Document as Multiple SAS Data Sets.)
Identify the SAS data set observation boundary, which translates into a collection of rows with a constant set of columns.
In the XML document, information about individual teams occurs in a <TEAM> tag located with <CONFERENCE> and <DIVISION> enclosures. You want a new observation generated each time a TEAM element is read.
Collect column definitions for each table.
For this XML document, the data content form is mixed. Some data occurs as XML PCDATA (for example, CONFERENCE), and other data is contained in attribute-value pairs (for example, NAME). Data types are all string values. The constructed observation will also include the team NAME and ABBREV. A length of 30 characters is sufficient for the NAME, and three characters is enough for the ABBREV field contents.
Add foreign keys or required external context.
You want to include information about the league orientation for the teams. Also, you want to extract CONFERENCE and DIVISION data.
Note: The retain= attribute in the column definition forces retention of processed data values after an observation is written to the output data set. Because the foreign key fields occur outside the observation boundary (that is, they are more sparsely populated in the hierarchical XML data than in the SAS observation), their values for additional rows need to be retained as they are encountered.
Define a location path for each variable definition.
The PATH element identifies a position in the XML document from which to extract data for each column. Element-parsed character data is treated differently than attribute values. There is no conditional selection criteria involved.
The following SAS statements import the XML document NHL.XML and specify the XMLMap named NHL.MAP. The PRINT procedure verifies that the import is successful:
filename NHL 'C:\My Documents\XML\NHL.xml'; filename MAP 'C:\My Documents\XML\NHL.map'; libname NHL xml xmlmap=MAP; proc print data=NHL.TEAMS; run;
PROC PRINT of Data Set NHL.TEAMS
The SAS System 1 Obs NAME ABBREV CONFERENCE DIVISION 1 Thrashers ATL Eastern Southeast 2 Hurricanes CAR Eastern Southeast 3 Panthers FLA Eastern Southeast 4 Lightning TB Eastern Southeast 5 Capitals WSH Eastern Southeast 6 Stars DAL Western Pacific 7 Kings LA Western Pacific 8 Ducks ANA Western Pacific 9 Coyotes PHX Western Pacific 10 Sharks SJ Western Pacific
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.