CDISC Dataset-XML

Overview

CDISC Dataset-XML defines a standard format for transporting tabular data in XML between any two entities based on CDISC ODM XML. In addition to supporting the transport of data sets as part of a submission to the FDA, Dataset-XML can be used to exchange data between two parties. For example, the Dataset-XML data format can be used by a CRO to transmit SDTM or ADaM data sets to a sponsor organization. Dataset-XML supports SDTM, ADaM, and SEND data sets but can also be used to exchange any other type of tabular data set.

Dataset-XML and Define-XML

The metadata for a data set in a Dataset-XML file must conform to the Define-XML standard. Each Dataset-XML file contains data for a single data set, but a single Define-XML file describes all of the data sets included in the folder. Both Define-XML 1.0 and Define-XML 2.0 are supported for use with Dataset-XML.

Creating Dataset-XML Files from SAS Data Sets: %DATASETXML_WRITE Macro

The %DATASETXML_WRITE macro creates a Dataset-XML file from a SAS data set or from a library of SAS data sets.
Here is an example:
libname srcdata  "&studyRootPath/data";
filename srcmeta "&studyRootPath/sourcexml/define.xml"; 
libname xmldata  "&studyOutputPath/sourcexml";

%datasetxml_write(
  _cstSourceLibrary=srcdata,
  _cstOutputLibrary=xmldata
  _cstSourceMetadataDefineFileRef=srcmeta,
  _cstCheckLengths=Y,
  _cstIndent=N,
  _cstZip=Y,
  _cstDeleteAfterZip=N
  );
In this example, the Dataset-XML files are compressed into ZIP files, with one ZIP file per Dataset-XML file. But, the Dataset-XML files are not deleted after compression.
Instead of specifying inputs (_cstSourceLibrary and _cstSourceMetadataDefineFileRef) and outputs (_cstOutputLibrary) for the process with the parameters, you can use the more traditional SASReferences data set. These different ways of specifying parameters are demonstrated in two sample programs: create_datasetxml_standalone.sas and create_datasetxml.sas. These sample programs are located here:
sample study library directory/cdisc-datasetxml-1.0.0–1.7/programs
Note: The create_datasetxml_standalone.sas sample program does not use a SASReferences data set and writes reports only in the SAS log file.
The Define-XML file that describes the SAS data sets must contain metadata about all SAS data sets and all variables to convert. The Dataset-XML files by themselves do not have any information about the SAS data sets (name and label) or the SAS variables (name, label, data type, length, and display format). When the Dataset-XML file is converted back to SAS data sets, this information must be provided by the Define-XML file.
Here is an example of a Dataset-XML file:
<?xml version="1.0" encoding="UTF-8"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" 
  xmlns:data="http://www.cdisc.org/ns/Dataset-XML/v1.0"
  ODMVersion="1.3.2" FileType="Snapshot" FileOID="cdisc01.AE"
  PriorFileOID="www.cdisc.org.Studycdisc01-Define-XML_2.0.0" 
  CreationDateTime="2014-06-23T13:18:18"
  data:DatasetXMLVersion="1.0.0">
  <ClinicalData StudyOID="cdisc01" 
    MetaDataVersionOID="MDV.CDISC01.SDTMIG.3.1.2.SDTM.1.2">
    <ItemGroupData ItemGroupOID="IG.AE" data:ItemGroupDataSeq="1">
	...
      <ItemData ItemOID="IT.AE.AETERM" Value="AGITATED"/>
Here is an example of a Define-XML file:
<ODM ... > 
  
  <Study OID="cdisc01">
    ...
    <MetaDataVersion OID="MDV.CDISC01.SDTMIG.3.1.2.SDTM.1.2"
      Name="Study CDISC01, Data Definitions"
      Description="Study CDISC01, Data Definitions"
      def:DefineVersion="2.0.0" def:StandardName="SDTM-IG"
      def:StandardVersion="3.1.2">
  ...
      <ItemGroupDef OID="IG.AE"
        Domain="AE" Name="AE" Repeating="Yes" IsReferenceData="No"
        SASDatasetName="AE" Purpose="Tabulation"
        def:Structure="One record per adverse event per subject"
        def:Class="EVENTS" def:ArchiveLocationID="LF.AE">
 ...
         <ItemRef ItemOID="IT.AE.AETERM" OrderNumber="6" Mandatory="Yes"/>
  ...
<ItemDef OID="IT.AE.AETERM" Name="AETERM" DataType="text" Length="25"
         SASFieldName="AETERM">
A Dataset-XML file must satisfy these requirements:
  • The ClinicalData attributes StudyOID and MetaDataVersionOID must be the same value as the corresponding OID attributes in the define.xml document.
  • The ItemGroupOID value must be the same value as the corresponding ItemGroup OID attribute in the define.xml document.
  • All ItemOID attributes in the ItemData elements must have values identical to the values of the corresponding ItemOID attributes in the ItemRef elements that are child elements of the corresponding ItemGroupDef element in the define.xml document.
It would be an error to try to extract from the Dataset-XML file the SAS data set name from an ItemGroup object identifier (ItemGroupOID=“IG.AE”). It would also be an error to try to extract the variable name from an object identifier (ItemOID=”IT.AE.AETERM”). There is no requirement concerning the values of the identifiers.
SAS tables and columns are matched to @SASDatasetName (or, if this value is not specified, @Name) and @SASFieldName (or, if this value is not specified, @Name). SASDatasetName and SASFieldName are optional but @Name is required. So, @Name is always available.
If the ItemGroup or ItemDef is not found, the XML is generated with this pattern for @ItemGroupOID and @ItemOID:
ItemGroupOID = ”IG.<table>”
ItemOID = “IT.<table>.<column>”
Although ItemGroupOID and ItemOID are generated for missing ItemGroups or ItemDefs, it is important to realize that this can lead to problems later when converting Dataset-XML files to SAS data sets. For example, when converting a Dataset-XML file into a SAS data set, ItemGroupOIDs or ItemOIDs that cannot be matched in the corresponding Define-XML file can lead to missing SAS data sets or missing SAS data set variables.
Warnings are written to the SAS log file and the write_results data set in the results folder.
Here is an example of the SAS log file:
WARNING: [CSTLOGMESSAGE.DATASETXML_WRITE] Columns not found in metadata:
         ADAE.AEDECOD ADAE.AETERM
WARNING: [CSTLOGMESSAGE.DATASETXML_WRITE] Missing ItemData/@ItemOID for column=AEDECOD
         has been set to IT.ADAE.AEDECOD
WARNING: [CSTLOGMESSAGE.DATASETXML_WRITE] Missing ItemData/@ItemOID for column=AETERM
         has been set to IT.ADAE.AETERM
The following display shows an example of the write_results data set as created by the create_datasetxml.sas sample program:
Example write_results Data Set
Example write_results Data Set
The @IsReferenceData attribute in the Define-XML file determines whether the data set is considered ReferenceData or ClinicalData. Here is an example:
<ReferenceData StudyOID="cdisc01"
    MetaDataVersionOID="MDV.CDISC01.SDTMIG.3.1.2.SDTM.1.2">
    <ItemGroupData ItemGroupOID="IG.TE" data:ItemGroupDataSeq="1">
      <ItemData ItemOID="IT.STUDYID" Value="CDISC01"/>
      <ItemData ItemOID="IT.TE.DOMAIN" Value="TE"/>
      <ItemData ItemOID="IT.TE.ETCD" Value="EOS"/>
      <ItemData ItemOID="IT.TE.ELEMENT" Value="End of Study"/>
      <ItemData ItemOID="IT.TE.TESTRL" Value="Study Termination"/>
      <ItemData ItemOID="IT.TE.TEDUR" Value="P1D"/>
    </ItemGroupData>

  <ClinicalData StudyOID="cdisc01"
    MetaDataVersionOID="MDV.CDISC01.SDTMIG.3.1.2.SDTM.1.2">
    <ItemGroupData ItemGroupOID="IG.AE" data:ItemGroupDataSeq="1">
      <ItemData ItemOID="IT.STUDYID" Value="CDISC01"/>
      <ItemData ItemOID="IT.AE.DOMAIN" Value="AE"/>
      <ItemData ItemOID="IT.USUBJID" Value="CDISC01.100008"/>
      <ItemData ItemOID="IT.AE.AESEQ" Value="1"/>
      <ItemData ItemOID="IT.AE.AESPID" Value="1"/>
      <ItemData ItemOID="IT.AE.AETERM" Value="AGITATED"/>
The _cstCheckLengths macro parameter enables the %DATASETXML_WRITE macro to determine whether the lengths defined in the metadata are long enough for character data. This check is important to avoid data truncation problems when importing the Dataset-XML files into SAS data set with the %DATASETXML_READ macro. Warnings are written to the SAS log file and the write_results data set in the results folder.
Here is an example of the SAS log file:
WARNING: [CSTLOGMESSAGE.DATASETXML_WRITE] Length too short: __ItemGroupOID=IG.ADAE
        __ItemOID=IT.ADAE.AETERM Length=20 _valueLength=24 value=HEARTBURN-LIKE DYSPEPSIA
WARNING: [CSTLOGMESSAGE.DATASETXML_WRITE] Length too short: __ItemGroupOID=IG.ADAE
        __ItemOID=IT.ADAE.AETERM Length=20 _valueLength=25 value=ACID REFLUX (OESOPHAGEAL)
WARNING: [CSTLOGMESSAGE.DATASETXML_WRITE] Length too short: __ItemGroupOID=IG.ADAE
        __ItemOID=IT.ADAE.AEDECOD Length=20 _valueLength=32 value=Gastrooesophageal reflux disease
WARNING: [CSTLOGMESSAGE.DATASETXML_WRITE] Length too short: __ItemGroupOID=IG.ADAE
        __ItemOID=IT.ADAE.AETERM Length=20 _valueLength=25 value=ACID REFLUX (OESOPHAGEAL) 
The following display shows an example of the write_results data set:
Example write_results Data Set
Example write_results Data Set
The %DATASETXML_WRITE macro also checks that numeric variables in ADaM data sets that represent date and time information have a DisplayFormat defined in the Define-XML file.

Creating SAS Data Sets from Dataset-XML Files: %DATASETXML_READ Macro

The %DATASETXML_READ macro creates a SAS data set or a library of SAS data sets from Dataset-XML files.
Here is an example:
%datasetxml_read(
  _cstSourceDatasetXMLLibrary=dataxml,
  _cstOutputLibrary=sdtmdata,
  _cstSourceMetadataDefineFileRef=defxml,
  _cstDatetimeLength=64,
  _cstAttachFormats=Y,
  _cstNumObsWrite=10000
  );
Instead of specifying inputs (_cstSourceDatasetXMLLibrary and _cstSourceMetadataDefineFileRef) and outputs (_cstOutputLibrary) for the process with parameters, you can use the more traditional SASReferences data set. These different ways of specifying parameters are demonstrated in two sample programs: create_sas_from_datasetxml_standalone.sas and create_sas_from_datasetxml.sas. These sample programs are located here:
sample study library directory/cdisc-datasetxml-1.0.0–1.7/programs
Note: The create_sas_from_datasetxml_standalone.sas sample program does not use a SASReferences data set and writes reports only in the SAS log file.
The Define-XML file that describes the Dataset-XML files must contain metadata information about all Dataset-XML files and all variables to convert to SAS data sets. The Dataset-XML files by themselves do not have any information about the SAS data sets (name and label) or the SAS variables (name, label, data type, length, and display format).
Character variables that represent date- and time-related information in ADaM or SDTM data conform to the ISO 8601 standard and do not have a length specified in the Define-XML file. The _cstDateTimeLength parameter specifies the length to use for these variables when they are converted to SAS data sets. If the lengths of character variables are too short to hold the data, warnings are written to the SAS log file and the read_results data set in the results folder.
Here is an example of the SAS log file:
WARNING: [CSTLOGMESSAGE.DATASETXML_READ] TRUNCATION occurred: Length=20 too short for
         ItemGroupDataSeq=12 IT.ADAE.AETERM value=HEARTBURN-LIKE DYSPEPSIA (length=24)
WARNING: [CSTLOGMESSAGE.DATASETXML_READ] TRUNCATION occurred: Length=20 too short for
         ItemGroupDataSeq=25 IT.ADAE.AETERM value=HEARTBURN-LIKE DYSPEPSIA (length=24)
WARNING: [CSTLOGMESSAGE.DATASETXML_READ] TRUNCATION occurred: Length=20 too short for
         ItemGroupDataSeq=28 IT.ADAE.AETERM value=ACID REFLUX (OESOPHAGEAL)
WARNING: [CSTLOGMESSAGE.DATASETXML_READ] TRUNCATION occurred: Length=20 too short for
         ItemGroupDataSeq=28 IT.ADAE.AEDECOD value=Gastrooesophageal reflux disease (length=32) 
The following display shows an example of the read_results data set as created by the create_sas_from_datasetxml.sas sample program:
Example read_results Data Set
Example read_results Data Set
Inconsistencies between the Dataset-XML file and the Define-XML file, which can lead to issues with matching data to metadata, are written to the SAS log file and the read_results data set in the results folder.
Here is an example of the SAS log file:
WARNING: [CSTLOGMESSAGE.DATASETXML_READ] Items not found in metadata:
         IT.ADAE.AEDECOD IT.ADAE.AETERM  
The following display shows an example of the read_results data set:
Example read_results Data Set
Example read_results Data Set
In the following example, the ADAE data set is created without the AETERM and AEDECOD variables, as shown in this PROC COMPARE output:
Dataset                 Created          Modified  NVar    NObs  Label

DATABASE.ADAE  23JUN14:09:01:29  23JUN14:09:01:29    61     106  Adverse Event Analysis Dataset
DATACOMP.ADAE  02JUL14:12:58:14  02JUL14:12:58:14    59     106  Adverse Event Analysis Dataset


Variables Summary

Number of Variables in Common: 59.
Number of Variables in DATABASE.ADAE but not in DATACOMP.ADAE: 2.


Listing of Variables in DATABASE.ADAE but not in DATACOMP.ADAE

Variable  Type  Length  Label

AETERM    Char     200  Reported Term for the Adverse Event
AEDECOD   Char     200  Dictionary-Derived Term
The sample programs create_sas_from_datasetxml.sas and create_sas_from_datasetxml_standalone.sas also contain a call to the %CSTUTILCOMPAREDATASETS macro. This call compares the original SAS data sets to the SAS data sets that were created from the Dataset-XML files.
%cstutilcomparedatasets(
  _cstLibBase=sdtmdat0,   
  _cstLibComp=sdtmdata, 
  _cstCompareLevel=0, 
  _cstCompOptions=%str(criterion=0.00000000000001),
  _cstCompDetail=Y
  );
For every SAS data set that is compared, the macro reports the error code as returned by PROC COMPARE. The following table shows the error codes:
Error Codes Returned by PROC COMPARE
Error Codes Returned by PROC COMPARE
For example, code=40 (8+32) indicates that a format and a label are different. This message is written to the SAS log file:
WARNING: [CSTLOGMESSAGE.CSTUTILCOMPAREDATASETS] Comparing srcdata.adqs and 
         trgdata.adqs  - Differences: FORMAT/LABEL (SysInfo=40)
When converting SAS data sets to Dataset-XML and then converting back to SAS data sets, here are difference to expect:
  • Date- and time-related columns do not have a length defined in the Define-XML metadata.
    Specify a length to assign in the macro (for example, _cstdatetimeLength=64). The length is used to create the date- and time-related columns but can be different from the original lengths.
  • SAS numeric variables are created with a length of 8 to avoid loss of precision, even when the original length or the length specified in the Define-XML file is less than 8.
  • Character variables (DataType=”text”) that do not have a length specified in the Define-XML file are created with a length of 200.
  • Small differences in precision can be expected around the machine precision for numeric variables that represent real numbers.
  • Character data that contains leading spaces or trailing spaces loses the leading and trailing spaces.
By specifying PROC COMPARE options with the _cstCompOptions parameter, you can specify that the comparison be less precise. For example, _cstCompOptions=%str(criterion=0.00000000000001). Lesser precision prevents differences close to machine precision from being reported as errors.
The following display shows an example of data set differences reported in the read_results data set:
Example read_results Data Set
Example read_results Data Set