Instructions for Implementing Duplicate-Data Checking

Overview

The instructions for implementing duplicate-data checking vary depending on how you have implemented processing. The following sections in this appendix provide detailed implementation instructions for various processing scenarios. See the following topics that correspond with how you are implementing processing:
Note: Duplicate-data-checking macros are designed to prevent the same data from being processed into the IT data mart twice. However, there might be times when you need to backload data, that is, to process some data that is in a datetime range for which the permanent control data sets have already recorded machine or system data. (For example, you might need to process data into one or more tables that you did not use earlier, or to process data into one or more tables that you accidentally purged or deleted.) Remember to restore the DUPMODE=FORCE or FORCE=YES setting to its original value after you finish that task.

Implementation for Supplied Adapters That Are MXG Based

Overview

Use the following instructions if you process data with a supplied adapter that is based on MXG code. Currently, these adapters include IBM DCOLLECT, IBM EREP, BMC Mainview IMS, DT Perf Sentry with MXG, IBM SMF, ASG TMON2CIC, ASG TMONDB2, CA TMS, IBM TPF, or IBM VMMON.
Here is an overview of the preparation for duplicate-data checking for this case:
  1. On the Staging Parameters tab of the staging transformation, set the Duplicate checking option to FORCE, DISCARD, or TERMINATE.
  2. Create a member or file that contains a call to the %RMDUPCHK macro.
  3. Review that member or file and modify it, if necessary. The staging transformation automatically inserts calls to the %RMDUPINT, %RMDUPDSN, %RMDUPCHK, and %RMDUPUPD macros.
The following sections describe the preceding steps in detail.

Set the Duplicate Checking Parameter on the Staging Parameters Tab

On the Staging Parameters tab of the staging transformation, set the Duplicate checking option to TERMINATE, DISCARD, or FORCE. Each of these values indicates that you do want to use duplicate-data checking. The particular value that you specify indicates how you want duplicate data to be handled.
TERMINATE
stops processing if duplicate data is encountered.
DISCARD
continues processing while rejecting duplicate data if it is encountered.
FORCE
continues processing and accepts duplicate data if it is encountered.
Note: Use this setting only when loading data that is not duplicate, even though it might be in the same datetime range and for the same machine or system as data that was loaded previously.

Create the Member or File

The MXG.SRCLIB is the primary MXG source library. The MXGUSER.SRCLIB directory is initially empty. It stores members that the user copies from MXSG.SRCLIB for customization.
Note: On z/OS, the MXGUSER.SRCLIB and the MXG.SRCLIB directories are created when SAS IT Resource Management is installed. If you are working with MXG on Windows or UNIX, you must install MXG separately and take note of its installed locations. This document will continue to use the terms MXG.SRCLIB, for the primary MXG source library, and MXGUSER.SRCLIB for the folder where you store your customizations.
The default call to the %RMDUPCHK macro is contained in a different member or file on each operating system. The following list shows how to enable duplicate-data checking for the IBM SMF adapter.
  • On UNIX, copy the dup_smf.sas file from the SAS IT Resource Management misc directory to the MXGUSER.SOURCLIB directory. Keep the same filename.
  • On Windows, copy the dup_smf.sas file from the SAS IT Resource Management sasmisc directory to the MXGUSER.SRCLIB directory. Keep the same filename.
  • On z/OS, copy the CPDUPSMF member from the SAS IT Resource Management CPMISC PDS to the MXGUSER.SRCLIB PDS. Keep the same member name.
    The following table lists the names of the members to copy from the z/OS CPMISC PDS. Files with similar names are in the misc directory on UNIX and the sasmisc directory on Windows.
    Find your adapter and the version of SAS (such as 9.1.3, or later) that you are using. That combination of adapter and version has one or more members. Copy the member or members from the appropriate location to the MXGUSER.SRCLIB PDS or directory. Keep the same member names.
    Each Adapter's CMDUPxxx Members and CPDUPxxx Members
    Adapter
    Members
    IBM DCOLLECT
    CMDUPDCO
    IBM EREP
    CMDUPERP, CMDUPER1, CMDUPER2
    BMC Mainview IMS
    CMDUPIMF
    DT Perf Sentry with MXG
    CMDUPNTS
    IBM SMF
    CPDUPSMF
    ASG TMONDB2
    CMDUPTMD
    ASG TMON2CIC
    CMDUPTM2
    CA TMS
    CMDUPTMS
    IBM TPF
    CMDUPTPF
    IBM VMMON
    CMDUPVMM

Modify the Member or File

The new member (on z/OS) or new file (on UNIX or Windows) contains a call to the %RMDUPCHK macro. The call is similar to this one for IBM SMF:
   %RMDUPCHK(SOURCE=SMF, 
            IDVAR=SYSTEM, 
            TIMESTMP=SMFTIME,
            ENDFILE=ENDOFSMF,
            INT=00:29, 
            SYSTEMS=3,
            RANGES=20,
            KEEP=9
            );
The following list describes the parameters and values that are used in this example:
  • SOURCE=identifier
    a three-character identifier of the data source or adapter. This value must be set to SMF for IBM SMF data. For a list of the SOURCE= names for the supported adapters, see %RMDUPCHK Required Arguments.
  • IDVAR=variable-name
    identifies the SAS variable that is used by MXG to denote the origin of each IBM SMF record. This value must be set to SYSTEM for the IBM SMF adapter, because that is the name of the variable that is used by MXG.
  • TIMESTMP=SAS-timestamp-variable
    identifies the SAS variable that is used by MXG to record the datetime stamp of each IBM SMF record. The variable name must be SMFTIME for the IBM SMF adapter, because that is the name of the timestamp variable that is used by MXG.
  • ENDFILE=variable-name
    identifies the name of the SAS variable that is used by MXG in the INFILE statement to denote the end-of-file condition for the IBM SMF data. The variable name must be ENDOFSMF for the IBM SMF adapter.
  • INT=interval
    represents the maximum interval that is permitted between the datetime stamps on any two consecutive IBM SMF records from the same z/OS system. If the interval between the datetime stamp values exceeds the value of this parameter, then a new range is created. (See the RANGES= parameter for more information.) For IBM SMF data, the default value for the INT= parameter is 29 minutes.
  • SYSTEMS=number-of-systems
    represents an estimate of the maximum number of z/OS systems for which your file contains data. For IBM SMF, the default value indicates a maximum of three systems.
  • RANGES=number-of-ranges
    represents the maximum number of ranges that can occur during this execution of the IBM SMF adapter. A new range is created when the difference between the datetime stamps of two consecutive records exceeds the value of the INT= parameter. This break is referred to as a gap in the data. For IBM SMF, the default value indicates a maximum of 20 ranges. This means that there can be no more than 19 gaps larger than 29 minutes (the value of the INT= parameter).
  • KEEP=number-of-weeks
    represents the maximum number of weeks for which you want to retain control data. A range is aged out (removed) when the end-of-range datetime is older than permitted by the value of this parameter. For IBM SMF, the default value is nine weeks.
For a complete description of these and other parameters for this macro, see %RMDUPCHK.
Review and modify your adapter's default call as follows:
  • Review the values of the INT=, SYSTEMS=, RANGES=, and KEEP= parameters, and change those values to be appropriate for your site.
  • Do not use the TERM= parameter or FORCE= parameter. The default settings are TERM=NO and FORCE=NO. (If the Duplicate checking value on the Staging Parameters tab of the staging transformation is TERMINATE, then termination occurs when duplicate data is encountered and the setting of TERM=YES on %RMDUPCHK is not required. Likewise, if the Duplicate checking value on the Staging Parameters tab of the staging transformation is FORCE, then duplicate data is processed without requiring the setting of FORCE=YES on %RMDUPCHK.)
  • Do not change any of the other parameters or values in your call.

Implementation for Supplied Adapters That Are Not MXG Based

Overview

Use the following instructions if you process data with a supplied adapter that is not based on MXG code and whose data does not reside in a database or in a SAS library. (Currently, the adapters that are not MXG based include HP Perf Agent, HP Reporter, DT Perf Sentry, BMC Perf Mgr, MS SCOM, SAP ERP, SAR, SNMP, VMware vCenter, VMware Data Acquisition, and Web Log.)
Here is an overview of the preparation for duplicate-data checking for this case:
  1. If your data does not reside on a database, on the Staging Parameters tab of the staging transformation, set the Duplicate checking option to FORCE, DISCARD, or TERMINATE.
  2. For any data for the above adapters, create an entry that contains a call to the %RMDUPCHK macro.
  3. For any data for the above adapters, review that member or file and modify it, if necessary. The staging transformation automatically inserts calls to the %RMDUPINT, %RMDUPDSN, %RMDUPCHK, and %RMDUPUPD macros.
The following sections describe the preceding steps in detail.

Set the Duplicate Checking Parameter on the Staging Parameters Tab

For only those adapters whose data does not reside on a database, on the Staging Parameters tab of the staging transformation, set the Duplicate checking option to TERMINATE, DISCARD, or FORCE. Each of these values indicates that you do want to use duplicate-data checking. The particular value that you specify indicates how you want duplicate data to be handled.
TERMINATE
stops processing if duplicate data is encountered.
DISCARD
continues processing while rejecting duplicate data if it is encountered.
FORCE
continues processing and accepts duplicate data if it is encountered.
Note: Use this setting only when loading data that is not duplicate, even though it might be in the same datetime range and for the same machine or system as data that was loaded previously.
Note: The SNMP adapter requires that duplicate checking be turned on. This setting is necessary because neither method of gathering raw data for SNMP (HPNNM and rrdtool) can ensure that only the most recent raw data is being saved. Therefore, invoking the duplicate-data-checking code of SAS IT Resource Management is the only way to determine what is new data and what is duplicate data. For best results, the value for this parameter should always be set to Discard.

Create the Entry

The default call to the %RMDUPCHK macro is contained in the SASHELP library in a catalog entry named SASHELP.ITMS.<adapter-name>_RMDUPCHK.SOURCE.
The staging code looks for the call in the following location:
Source Locations for Each Adapter
ADAPTER
%RMDUPCHK Location
HP Perf Agent
<staging-library>.HPOVPA.RMDUPCHK.SOURCE
HP Reporter
<staging-library>.HPOVREP.RMDUPCHK.SOURCE
DT Perf Sentry
<staging-library>.NTSMF.RMDUPCHK.SOURCE
BMC Perf Mgr
<staging-library>.PATROL.RMDUPCHK.SOURCE
SAP ERP
<staging-library>.SAP.RMDUPCHK.SOURCE
SAR
<staging-library>.SAR.RMDUPCHK.SOURCE
MS SCOM
<staging-library>.SCOM.RMDUPCHK.SOURCE
SNMP
<staging-library>.SNMP.RMDUPCHK.SOURCE
VMware vCenter
<staging-library>.VMWARE.RMDUPCHK.SOURCE
Web Log
<staging-library>.WEBLOG.RMDUPCHK.SOURCE
To copy the default entry in SASHELP to the appropriate location in the staging library for an adapter, perform the following steps:
  1. Start an interactive SAS session on the system where the SAS Workspace Server executes.
  2. Allocate the staging library. You can determine the LIBNAME statement from SAS Management Console or SAS Data Integration Studio. To do so, right-click the staging library and select View LIBNAME. You can use the Copy and Paste functions to copy the LIBNAME statement that is subsequently displayed and use it in your SAS session.
  3. Submit the following code from the SAS Program Editor window:
    PROC CATALOG CAT=SASHELP.ITMS;
         COPY OUT=staging-library.adapter-name;
         
      SELECT adapter-name_RMDUPCHK.SOURCE;
      RUN; 
    QUIT;
    
    PROC CATALOG CAT=staging-library.adapter-name;
         
      CHANGE adapter-name_RMDUPCHK.SOURCE=RMDUPCHK.SOURCE;
         
      RUN;
    QUIT;
    Note: In the preceding code, replace staging-library with the libref submitted above. In addition, replace adapter-name with the adapter mnemonic listed in the %RMDUPCHK Location column of the Source Locations for Each Adapter (for example HPOVPA, HPOVREP, and so on).
Note: If the staging code is executed and there is no such entry in the staging library, then the staging code copies the default entry that is in SASHELP into the appropriate location in the staging library and displays a warning in the SAS log that notifies you of this action.

Modify the Entry

The new catalog entry contains a call to the %RMDUPCHK macro. To see the call, issue the following SAS command from a SAS command line (on z/OS) or from the command field on the SAS GUI (on Windows) or SAS Toolbar (on UNIX):
NOTE staging-library.adapter-name.RMDUPCHK.SOURCE
The command opens a Notepad window where the contents of the source entry are displayed. For example, this is the call to %RMDUPCHK for SAP ERP data:
 
   %RMDUPCHK
            (SOURCE=SAP,
            IDVAR=SYSHOST,
            TIMESTMP=DATETIME,
            ENDFILE=DUPEND,
            INT=00:20,
            RANGES=10,
            SYSTEMS=10,
            KEEP=52 
            );
The following list describes the parameters and values that are used in this example:
  • SOURCE=identifier
    is a three-character identifier of the data source or adapter. This value must be set to SAP for the SAP ERP data. For a list of the SOURCE= names for the supported adapters, see %RMDUPCHK Required Arguments.
  • IDVAR=variable-name
    identifies the SAS variable that is used to denote the origin of each SAP ERP record. The variable name must be SYSHOST for the SAP ERP adapter and must not be changed.
  • TIMESTMP=SAS-timestamp-variable
    identifies the SAS variable that is used to record the datetime stamp of each SAP ERP record. The variable name must be DATETIME for the SAP ERP adapter and must not be changed.
  • ENDFILE=variable-name
    identifies the name of the SAS variable that is used in the INFILE statement to denote the end-of-file condition for the SAP ERP data. The variable name must be DUPEND for the SAP ERP adapter and must not be changed.
  • INT=interval
    represents the maximum interval that is permitted between the datetime stamps on any two consecutive SAP ERP records from the same machine and host combination. If the interval between the timestamp values exceeds the value of this parameter, then a new time range is created. (See the RANGE= parameter for more information.) In this example, the specified value is 20 minutes.
  • RANGES=number-of-ranges
    represents the maximum number of ranges that can occur during this execution of the SAP ERP adapter. A new range is created when the difference between the datetime stamps of two consecutive records exceeds the value of the INT= parameter. This break is referred to as a gap in the data. For the SAP ERP adapter, the default value indicates a maximum of 10 ranges. This means there can be no more than 9 gaps greater than 20 minutes (the value of the INT= parameter).
  • SYSTEMS=number-of-systems
    represents an estimate of the maximum number of machine and host combinations for which you expect to process data. The default value indicates a maximum of 10 systems.
  • KEEP=number-of-weeks
    represents the maximum number of weeks for which you want to retain control data. A range is aged out (removed) when the end-of-range datetime is older than permitted by the value of this parameter. For SAP ERP data, the default value is 52 weeks.
For a complete description of these and other parameters for this macro, see %RMDUPCHK.
Review and modify your adapter's default call as follows:
  • Review the values of the INT=, SYSTEMS=, RANGES=, and KEEP= parameters, and change those values to be appropriate for your site.
  • Do not use the TERM= parameter or FORCE= parameter. The default settings are TERM=NO and FORCE=NO. (If the Duplicate checking value on the Staging Parameters tab of the staging transformation is TERMINATE, then termination occurs when duplicate data is encountered and the setting of TERM=YES on %RMDUPCHK is not required. Likewise, if the Duplicate checking value on the Staging Parameters tab of the staging transformation is FORCE, then duplicate data is processed without requiring the setting of FORCE=YES on %RMDUPCHK.)
  • Do not change any of the other parameters or values in your call.
  • When you are finished, press the END key or issue the END command from the command line to save the change and exit from the Notepad window. Then, you can exit the SAS session.

Implementation for User-Written Staging Code That Is Based on MXG Code

Overview

Use these instructions if you process data with user-written staging code that is based on MXG code.
Note: The following example demonstrates how to modify the MXG code that processes IBM DCOLLECT data. This example is for illustrative purposes only, and it uses IBM DCOLLECT data as a sample. In actual use, IBM DCOLLECT is a supported adapter, and SAS IT Resource Management modifies the MXG code as needed to process IBM DCOLLECT data. Thus, if you are using IBM DCOLLECT data in a real scenario, no MXG customization is necessary and you can use the instructions in the section Implementation for Supplied Adapters That Are MXG Based.
Here is an overview of the preparation for duplicate-data checking in this case:
  1. Create a member or file for a call to the %RMDUPCHK macro.
  2. In that member or file, add the %RMDUPCHK call and modify the call as necessary.
  3. Insert calls and include a member or file:
    • Insert a call to the %RMDUPINT, macro in the MXG code.
    • Insert a call to the %RMDUPDSN macro in the MXG code.
    • Include into the MXG code the member or file that contains the call to the %RMDUPCHK macro.
    • Add a call to the %RMDUPUPD macro at the end of your code.
The following sections describe the preceding steps in detail.

Create the Member or File

The following list details three hosts and the name of the member or file that is required in MXGUSER.SRCLIB for each. If the appropriate member or file does not already exist in MXGUSER.SRCLIB, create it and specify the name that corresponds with your host system.
z/OS Specifics: Create member $$GENxxx.
UNIX Specifics: Create file gen_xxx.sas.
Windows Specifics: Create file gen_xxx.sas.
Note: The value xxx in the preceding examples represents a three-character identifier for your data source or adapter. You can choose any identifier as long as it is not one of the identifiers on the list of values for the SOURCE= parameter in %RMDUPCHK. For example, a three-character identifier for IBM DCOLLECT might be “DCO.”

Modify the Member or File

In the member or file created or used in the preceding step, add a call to %RMDUPCHK.
Here is an example for IBM DCOLLECT:
   %RMDUPCHK(SOURCE=DCO,IDVAR=DCUSYSID,TIMESTMP=DCUTMSTP,
             ENDFILE=EOF,INT=25:00,SYSTEMS=2,
             RANGES=3,KEEP=9,
             FORCE=NO,TERM=NO
             );
The following list describes the parameters and values that are used in this example:
  • SOURCE=identifier
    specifies a three-character identifier of the data source or adapter. In this example, “DCO” is the value that is used for IBM DCOLLECT data. For a list of the SOURCE= names for the supported adapters, see %RMDUPCHK Required Arguments.
  • IDVAR=variable-name
    specifies the name of the SAS variable that is used by MXG to denote the origin of each record. This example uses the value DCUSYSID, because that is the variable name that is used in VMACDCOL for IBM DCOLLECT data.
  • TIMESTMP=timestamp-variable-name
    specifies the name of the SAS variable that is used by MXG that contains the timestamp of each IBM DCOLLECT record. The variable name must be set to DCUTMSTP for IBM DCOLLECT data, because that is the name used in VMACDCOL. Note that every IBM DCOLLECT record will have the same timestamp because it represents the time when the IDCAMS IBM DCOLLECT facility actually output the records. As shown in this example, you should change the value of the INT= parameter to accommodate this difference.
  • ENDFILE=variable-name
    specifies the SAS variable that represents the end-of-file condition for your source data. For IBM DCOLLECT data, this value must be set to EOF, because that is the name of the SAS variable that is used by MXG in the INFILE statement in VMACDCOL in order to indicate the end-of-file condition.
  • INT=interval
    represents the maximum interval that is allowed between the datetime stamps on any two consecutive IBM DCOLLECT records from the same z/OS system. If the interval between the datetime stamp values exceeds the value of this parameter, then a new range is created. (See the RANGES= parameter for more information.) In this example, the specified value is 25 hours, because the timestamp for each IBM DCOLLECT record is the same. In effect, this enables the daily IBM DCOLLECT facility to run up to one hour late each day.
  • SYSTEMS=number-of-systems
    represents an estimate of the maximum number of z/OS systems for which your file contains data. In this example, the value indicates that data contains values from no more than two systems. Therefore, the DCUSYSID variable (which is specified in the preceding example as the value of the IDVAR= parameter) should contain no more than two values.
  • RANGES=number-of-ranges
    represents the maximum number of ranges that can occur while this data is being read. A new range is created when the difference between the datetime stamps of two consecutive records exceeds the value of the INT= parameter. This break is referred to as a gap in the data. In this example, the specified value is a maximum of three. This means there can be no more than two gaps greater than 25 hours (the value of the INT= parameter).
  • KEEP=number-of-weeks
    represents the maximum number of weeks for which you want to retain control data. A range is aged out (removed) when the end-of-range datetime is older than allowed by the value of this parameter. In this example, the specified number of weeks is nine.
  • FORCE=NO | YES
    FORCE=NO indicates that duplicate data should not be processed into the IT data mart.
  • TERM=NO | YES
    TERM=NO indicates that processing should not stop if duplicate data is encountered.
For a complete description of these and other parameters and appropriate values for each adapter, see %RMDUPCHK.
Review and modify your calls using the following guidelines and recommendations:
  • For the SOURCE=value, specify the three-character identifier that you chose earlier.
  • For the ENDFILE=value, refer to the MXG staging code for your adapter.
  • For the IDVAR=value, refer to the MXG staging code for your adapter.
  • For the TIMESTMP=value, refer to the MXG staging code for your adapter.
  • Review the INT=, SYSTEMS=, RANGES=, and KEEP= values and change them to be appropriate for your site.
  • Review the TERM= and FORCE= values and change them to be appropriate for your purpose. (Note that the combination TERM=YES, FORCE=YES is not meaningful.)

Insert Calls and Include Member or File

This part of the implementation process involves inserting the call to %RMDUPINT, inserting the call to %RMDUPDSN, and including the member or file that contains the call to %RMDUPCHK. These instructions depend on whether you are using the IMACxxxx or IMACKEEP member to specify your data source customizations.
If you are using the IMAC member to specify your data source customizations, perform the following steps:
  1. Insert a call to %RMDUPINT.
    1. In MXG.SRCLIB, in member or file IMACAAAA, there is a table with the names of the other IMAC members or files and a description of what is in each. Find the name of the IMAC member or file for your adapter. For example, the member IMACDCOL is for IBM DCOLLECT data.
    2. If your adapter's IMAC member or file is not already in MXGUSER.SRCLIB, copy that member or file from MXG.SRCLIB to MXGUSER.SRCLIB. Keep the same name.
    3. At the beginning of your adapter's IMAC member or file in MXGUSER.SRCLIB, insert a call to the %RMDUPINT macro. Use the following form: %RMDUPINT ;
      Note: The %RMDUPINT macro does not have any parameters.
  2. Insert a call to %RMDUPDSN.
    In that same IMAC member or file, inside any one (but only one) of the _L macro definitions, insert a call to the %RMDUPDSN macro. Use the following form: %%RMDUPDSN(SOURCE=xxx) where xxx is a three-character identifier for your data source or adapter. You can choose any identifier as long as it is not one of the identifiers that is on the list of values for the SOURCE= parameter in the %RMDUPCHK macro.
    Note: The call has an unusual form: there are two leading percent signs and no trailing semicolon.
    For example, in IMACDCOL, you might change the following code:
        
              MACRO _LDCODSN DCOLDSET %
              MACRO _KDCODSN          %
              MACRO _LDCOCLU DCOLCLUS %
              ...
    The changed version would look like this:
        
              MACRO _LDCODSN %%rmdupdsn(source=DCO) DCOLDSET %
              MACRO _KDCODSN          %
              MACRO _LDCOCLU DCOLCLUS %
              ...
  3. Include the member or file that contains the call to %RMDUPCHK.
    1. In member or file IMACAAAA of MXG.SRCLIB, there is a list of the names of the members or files and a description of what is in each. Find the name of the member or file that decodes data for your adapter. For example, the member or file for IBM DCOLLECT data is named VMACDCOL.
    2. In MXG.SRCLIB, scan through your adapter's VMAC member or file for a reference to an IHDR member or file. For example, for IBM DCOLLECT data in MXG version 14.02 and later, VMACDCOL refers to a member or file that is called IHDRDCOL.
    3. If the VMAC member or file does refer to an IHDR member or file, include the member or file that contains the call to %RMDUPCHK in the IHDR member or file.
      If the member or file with the IHDR name is not already in MXGUSER.SRCLIB, copy that member or file from MXG.SRCLIB to MXGUSER.SRCLIB. Keep the same name.
      At the beginning of that member or file in MXGUSER.SRCLIB (typically, the member or file is empty), insert the following code:
      z/OS Specifics:
      %INCLUDE '<codeFocus>fully-qualified-PDS</codeFocus>($$GENxxx)';
      UNIX Specifics:
      %INCLUDE '<codeFocus>full-path-directory</codeFocus>/gen_xxx.sas'; 
      Remember that directory names and filenames are case sensitive on UNIX.
      Windows Specifics:
      %INCLUDE full-path-directory\gen_xxx.sas';
      For fully-qualified-PDS or full-path-directory, substitute the location of MXGUSER.SRCLIB. The member or filename refers to the member or file that you created in the preceding steps. The xxx refers to the three-character identifier that you chose in the preceding steps.
      Note: The %INCLUDEs in the preceding examples use absolute locations. If you prefer to use relative locations, use the following code.
      z/OS Specifics:
      %INCLUDE SOURCLIB($$GENxxx);
      UNIX Specifics:
      %INCLUDE SOURCLIB(gen_xxx);
      Windows Specifics:
      %INCLUDE SOURCLIB(gen_xxx);
    4. If the VMAC member or file does not refer to an IHDR member or file, include the member or file that contains the call to %RMDUPCHK in every one of the adapter's EX members or files.
      • Identify all of the EX members or files for your adapter. To find the member names, scan the adapter's VMAC member or file in MXG.SRCLIB and look for %INCLUDE statements that refer to members of SOURCLIB that have the prefix EX.
      • For each EX member or file, if a member or file with the same name does not already exist in MXGUSER.SRCLIB, copy the EX member or file from MXG.SRCLIB into MXGUSER.SRCLIB. Keep the same name.
      • In each of these EX members or files in MXGUSER.SRCLIB, there is an OUTPUT statement. In front of the OUTPUT statement, on a new line, insert the following line of code:
        z/OS Specifics:
        %INCLUDE
        'fully-qualified-PDS($$GENxxx)';
        UNIX Specifics:
        %INCLUDE
        'full-path-directory/gen_xxx.sas'; 
        Remember that directory names and filenames are case sensitive on UNIX.
        Windows Specifics:
        %INCLUDE
        'full-path-directory\gen_xxx.sas';
      • For fully-qualified-PDS and full-path-directory, substitute the location of MXGUSER.SRCLIB. The member name or filename refers to the member or file that you created in the preceding steps. The xxx refers to the three-character identifier that you chose in the preceding steps.
        The %INCLUDEs in the preceding examples use absolute locations. If you prefer to use relative locations, use the following code:
        z/OS Specifics:
        %INCLUDE SOURCLIB($$GENxxx);
        UNIX Specifics:
        %INCLUDE SOURCLIB(gen_xxx);
        Windows Specifics:
        %INCLUDE SOURCLIB(gen_xxx);
      Note: Most data sources have several OUTPUT statements (one OUTPUT statement per EX member) and thus the potential for calling %RMDUPCHK several times. The duplicate-data-checking macros have verification routines to ensure that %RMDUPCHK executes only once and checks the data only once.
  4. After the end of the MXG code, add a call to the %RMDUPUPD macro. Use the following form: %RMDUPUPD;
    Note: The %RMDUPUPD macro does not have any parameters.
If you are using the IMACKEEP member to specify your data source customizations, perform the following steps:
  1. Insert a call to the %RMDUPINT macro.
    1. If the IMACKEEP member or file is not already in MXGUSER.SRCLIB, copy it from MXG.SRCLIB to MXGUSER.SRCLIB. Keep the same name.
    2. In the IMACKEEP member or file in MXGUSER.SRCLIB, insert a call to the %RMDUPINT macro immediately after the %INCLUDE statement for IMACOLDV. Use the following form: %RMDUPINT;
      Note that the %RMDUPINT macro does not have any parameters.
  2. Insert a call to the %RMDUPDSN macro.
    1. In the IMACKEEP member or file in MXGUSER.SRCLIB, there is a group of _W macro definitions for your adapter. Copy any one (but only one) of these _W macro definitions and insert it immediately after the call to the %RMDUPINT macro.
    2. Inside that copy, insert a call to the %RMDUPDSN macro. Use the following form:%%RMDUPDSN(SOURCE=xxx)where xxx is the three-character identifier that you chose in the preceding steps. Note that the call has an unusual form: there are two leading percent signs and no trailing semicolon. For example, for IBM DCOLLECT, you might copy the following macro definition:
      MACRO _WDCODSN DCOLDSET%
      When you insert the call, the copy will look like this:
      MACRO _WDCODSN %%RMDUPDSN(SOURCE=DCO) DCOLDSET %
  3. Include the member or file that contains the call to the %RMDUPCHK macro.
    1. In MXG.SRCLIB, in member or file IMACAAAA, there is a list of the names of the members or files and a description of what is in each. Find the name of the member or file that decodes data for your adapter. For example, the member or filename for IBM DCOLLECT data is VMACDCOL.
    2. Scan through your adapter's VMAC member or file for a reference to an IHDR member or file. For example, for IBM DCOLLECT data in MXG version 14.02 and later, VMACDCOL refers to a member or file that is called IHDRDCOL.
    3. If the VMAC member or file does refer to an IHDR member or file, include the member or file that contains the %RMDUPCHK call into the IHDR member or file.
      • If the member or file with the IHDR name is not already in MXGUSER.SRCLIB, copy that member or file from MXG.SRCLIB to MXGUSER.SRCLIB. Keep the same name.
      • At the beginning of that member or file in MXGUSER.SRCLIB (typically, the member or file is empty), insert the following line of code:
        z/OS Specifics:
        %INCLUDE 'fully-qualified-PDS($$GENxxx)';
        UNIX Specifics:
        %INCLUDE 'full-path-directory/gen_xxx.sas'; 
        Remember that directory names and filenames are case sensitive on UNIX.
        Windows Specifics:
        %INCLUDE 'full-path-directory\gen_xxx.sas';
      • For fully-qualified-PDS or full-path-directory, substitute the location of MXGUSER.SRCLIB. The member name or filename refers to the member or file that you created in the preceding steps. The xxx refers to the three-character identifier that you chose in the preceding steps.
        Note: The %INCLUDEs in the preceding examples use absolute locations. If you prefer to use relative locations, use the following code:
        z/OS Specifics:
        %INCLUDE SOURCLIB($$GENxxx);
        UNIX Specifics:
        %INCLUDE SOURCLIB(gen_xxx);
        Windows Specifics:
        %INCLUDE SOURCLIB(gen_xxx);
    4. If the VMAC member or file does not refer to an IHDR member or file, include the member or file that contains the %RMDUPCHK call into _E macro definitions in the IMACKEEP member or file.
      In your adapter's VMAC member or file in MXGUSER.SRCLIB, there is a group of _E macro definitions. Copy all of these _E macro definitions from the VMAC member or file to the beginning of the IMACKEEP member or file.
      Inside each of the _E macro definitions in the IMACKEEP member or file, include the member or file that contains the call to %RMDUPCHK. Use the following form:
      z/OS Specifics:
      %%INCLUDE 'fully-qualified-PDS($$GENxxx)';
      UNIX Specifics:
      %%INCLUDE 'full-path-directory/gen_xxx.sas'; 
      Remember that directory names and filenames are case sensitive on UNIX.
      Windows Specifics:
      %%INCLUDE 'full-path-directory>\gen_xxx.sas';
      To specify the values for fully-qualified-PDS and full-path-directory, substitute the location of MXGUSER.SRCLIB. The member name or filename refers to the member or file that you created in the preceding steps. The xxx refers to the three-character identifier that you chose in the preceding steps.
      Note: The %INCLUDEs in the preceding examples use absolute locations. If you prefer to use relative locations, use the following code:
      z/OS Specifics:
      %%INCLUDE SOURCLIB($$GENxxx);
      UNIX Specifics:
          %%INCLUDE SOURCLIB(gen_xxx);
      Windows Specifics:
          %%INCLUDE SOURCLIB(gen_xxx);
      For example, you might want to use relative locations instead of the absolute locations in the following code:
          MACRO _EDCODSN
                %%INCLUDE SOURCLIB(EXDCODSN); 
                %
      The revised code that uses relative locations would look like this:
          MACRO _EDCODSN
                %%INCLUDE SOURCLIB($$GENDCO); 
                %%INCLUDE SOURCLIB(EXDCODSN);
                %
      Note: Although most data sources have several _E macros and, therefore, several %RMDUPCHK calls, the duplicate-data-checking macros use verification routines to ensure that the macro executes only once and checks the data only once.

Implementation for User-written Staging Code That Is Not Based on MXG Code

Overview

Use these instructions if you process data with user-written staging code that is not based on MXG code. Here is an overview of the preparation for duplicate-data checking in this case:
  1. Create a member or file for a call to the %RMDUPCHK macro.
  2. In that member or file, add the %RMDUPCHK call and modify the call as necessary.
  3. Insert a call to the %RMDUPINT macro in your staging code.
  4. Insert a call to the %RMDUPDSN macro in your staging code.
  5. Include in your staging code the member or file that contains the call to the %RMDUPCHK macro and add a call to the %RMDUPUPD macro at the end of your code.
The following sections describe, in detail, the steps that you perform.

Create the Member or File

MXG has a convenient storage place for the member or file.
If you have MXG, use the following list to determine how to specify the name of the member or file that is required in MXGUSER.SRCLIB for the host where you will process the data. If the appropriate member or file does not already exist in MXGUSER.SRCLIB, create it and use the name that corresponds with your host system.
z/OS Specifics: Create member $$GENxxx.
UNIX Specifics: Create file gen_xxx.sas.
Windows Specifics: Create file gen_xxx.sas.
If you do not have MXG, then use a storage location that you name. Use the following list to determine that action that you must perform:
z/OS Specifics: Create a PDS and, in it, create a member named $$GENxxx.
UNIX Specifics: Create a directory and, in it, create a file named gen_xxx.sas.
Windows Specifics: Create a directory and, in it, create a file named gen_xxx.sas.
The xxx noted in the preceding examples is a three-character identifier for your data source or adapter. You can choose any identifier as long as it is not one of the identifiers on the list of values for the SOURCE= parameter in %RMDUPCHK.

Modify the Member or File

In this member or file, if there is no call to the %RMDUPCHK macro, add a %RMDUPCHK call that is similar to the following code:
%RMDUPCHK(
          SOURCE=ABC,
          IDVAR=SYSTEMID,
          TIMESTMP=DTIME,
          ENDFILE=LAST,
          INT=00:10,
          SYSTEMS=4,
          RANGES=4,
          KEEP=53)
          ;
Note: If you are running on z/OS, you will find some additional %RMDUPCHK examples in the SAS IT Resource Management PDS that is named CPMISC.
The following list describes the parameters and values that are used in this example:
  • SOURCE=identifier
    specifies a three-character identifier for your data source or adapter. This example uses the value ABC.
  • IDVAR=variable-name
    specifies the name of the SAS variable that identifies the system or machine from which the data originated. This example uses the value SYSTEMID.
  • TIMESTMP=timestamp-variable-name
    specifies the name of the SAS variable that contains the timestamp for each input record. In this example, the value is DTIME.
  • ENDFILE=variable-name
    identifies the SAS variable that represents the end-of-file condition for your INFILE statement. In this example, the value is LAST.
  • INT=interval
    represents the maximum interval that is allowed between the datetime stamps on any two consecutive input records from the same system or machine. If the interval between the datetime stamp values exceeds the value of this parameter, then a new range is created. (See the RANGES= parameter for more information.) In this example, a range can contain gaps no larger than 10 minutes.
  • SYSTEMS=number-of-systems
    represents an estimate of the maximum number of systems or machines that collected data for your input file. In this example, the value indicates that data is from no more than four systems. Therefore, the SYSTEMID variable (which is specified in the preceding example as the value of the IDVAR= parameter) should contain no more than four values.
  • RANGES=number-of-ranges
    represents the maximum number of ranges that can occur while the data is being read. A new range is created when the difference between the datetime stamps of two consecutive records exceeds the value of the INT= parameter. This break is referred to as a gap in the data. In this example, the specified value is four. This means that there can be no more than three gaps greater than 10 minutes (the value of the INT= parameter).
  • KEEP=number-of-weeks
    represents the maximum number of weeks for which you want to retain control data. A range is aged out (removed) when the end-of-range datetime is older than permitted by the value of this parameter. In this example, the specified value is 53 weeks.
For a complete description of these and other parameters for this macro, see %RMDUPCHK.
Review and modify your calls using the following guidelines and recommendations:
  • For the SOURCE= value, specify the three-character identifier that you chose earlier.
  • For the ENDFILE= value, refer to your staging code.
  • For the IDVAR= value, refer to your staging code.
  • For the TIMESTMP= value, refer to your staging code.
  • Review the INT=, SYSTEMS=, RANGES=, and KEEP= values and change them to be appropriate for your site.
  • Review the TERM= and FORCE= values and change them to be appropriate for your purpose.
    Note: The combination TERM=YES, FORCE=YES is not meaningful.

Insert a Call to %RMDUPINT

In your staging code, insert a call to the %RMDUPINT macro in front of the DATA statement that initiates the DATA step that reads your raw data. Use the following form:
%RMDUPINT;
Note: The %RMDUPINT macro does not have any parameters.

Insert a Call to %RMDUPDSN

For the call, use the following form:
%RMDUPDSN(SOURCE=xxx)
In this form, xxx is the three-character source name that you chose in the preceding steps for this data source.
Note: The call has an unusual form: there is no trailing semicolon.
The call to the %RMDUPDSN macro must be inserted between the word DATA and the semicolon character (;) in the DATA statement that initiates the DATA step that reads the raw data and writes the staged data. For example, here is a DATA step that reads raw data and writes the staged data to GENLIB.STAGED:
             DATA GENLIB.STAGED;
             ...
             OUTPUT GENLIB.STAGED;
You would change that call as follows:
             DATA GENLIB.STAGED %rmdupdsn(source=xxx);
             ...
             OUTPUT GENLIB.STAGED;
Note: Any OUTPUT statement to the data set or view to which the data is being staged (in this case, GENLIB.STAGED) must explicitly specify the data set or view. Therefore, do not use the OUTPUT statement in this form:
OUTPUT ;
Instead, use the OUTPUT statement in this form:
             OUTPUT libref.data-set-or-view-name;

Include the Member or File That Contains the Call to %RMDUPCHK

Place the %INCLUDE statement in your staging code after the TIMESTMP and IDVAR variables have been read in from your raw data and before any logic that would execute an OUTPUT statement. Within that range, place the %INCLUDE statement as early in your code as possible to avoid any unnecessary processing of data that is discovered to be duplicate data.
Point the %INCLUDE statement to the member or file that you created in the preceding steps. If you want to specify absolute locations, insert the following line in your staging code and use the following form:
z/OS Specifics:
      %INCLUDE 'fully-qualified-PDS($$GENxxx)';
UNIX Specifics:
      %INCLUDE 'full-path-directory/gen_xxx.sas';
Note: Remember that directory names and filenames are case sensitive on UNIX.
Windows Specifics:
      %INCLUDE 'full-path-directory\gen_xxx.sas';
For fully-qualified-PDS or full-path-directory, substitute the location of MXGUSER.SRCLIB. The xxx refers to the three-character identifier that you chose in the preceding steps.
If you want to specify relative locations, use the following form:
  • If you stored the %RMDUPCHK call in MXGUSER.SRCLIB, use the following code:
    z/OS Specifics:
     %INCLUDE SOURCLIB($$GENxxx);
    UNIX Specifics:
    %INCLUDE SOURCLIB(gen_xxx);
    Operating Environment Information:
    %INCLUDE SOURCLIB(gen_xxx);
  • If you did not store the %RMDUPCHK call in MXGUSER.SRCLIB, create a fileref for your storage location by using the SAS FILENAME statement:
           FILENAME myfref 'fully-qualified-PDS-name-or-full-path-directory-name';
    In the statement, myfref is the name that you give to this fileref. After you create the fileref, use the following code:
    z/OS Specifics:
    %INCLUDE myfref ($$GENxxx)
    UNIX Specifics:
    %INCLUDE myfref (gen_xxx)
    Windows Specifics:
    %INCLUDE myfref (gen_xxx)
Note: Do not add the .sas extension on UNIX or Windows. The extension will be appended automatically on UNIX and Windows.

Implementation for SAP ERP, MS SCOM, HP Reporter, and VMware Adapters

The staging code for the SAP ERP, MS SCOM, HP Reporter, and VMware adapters is designed to read in only those rows in the input database that were not previously staged. Even though the staged table is overwritten each time a staging job is executed, a record is kept of the data that was read so that only new data is read when the staging job is run.
To accomplish this, the staging code creates control data sets in the staging library for that adapter. For all adapters, two control data sets are created for each staged table that is processed. These data sets are named xxxDAY and xxxCNTRL. These control data sets determine the raw data observations that should be staged. If the control data sets do not exist when the staging job is executed, then all of the data in the input database is staged.
The staging code for the SAP ERP, MS SCOM, HP Reporter, and VMware adapters always uses the control data sets so that only new data is staged. Therefore, the duplicate-data checking option is active by default for these adapters. All subsequent staging jobs that specify that adapter's staging library location use that library's control data sets.
Note: You can turn off the duplicate-data checking option by modifying the generated SAS code. For an example that shows how to do so, see Example 3: Using Macro Variables to Subset Data for HP Reporter, MS SCOM, and VMware Adapters.