Duplicate-Data Checking Macros

%RMDUPCHK

%RMDUPCHK Overview

%RMDUPCHK checks for duplicate data and deletes it. It also builds up record counts of incoming and deleted data and datetime ranges for each system or machine. These record counts are stored in the control data set. If the control data set indicates that a gap was detected in the data, a report is generated.
The control data set is stored in the same library as the staged tables. This data set is created and managed by the %RMDUPxxx macros. (Users do not usually access this library.)
Note: For information about how to set up the %RMDUPCHK macro, see Working with Duplicate-Data Checking Macros. For information about how control data sets work, see Control Data Sets for Duplicate-Data Checking.

%RMDUPCHK Syntax

%RMDUPCHK(
ENDFILE=variable-name
,IDVAR=variable-name
,SOURCE=identifier
,TIMESTMP=timestamp-variable-name
<,FORCE=YES | NO>
<,INT=interval>
<,KEEP=number-of-weeks>
<,REPORT=YES | NO>
<,TERM=YES | NO>
);

%RMDUPCHK Required Arguments

ENDFILE=variable-name

specifies the name of the SAS variable that is used as the END= keyword for the SAS INFILE statement that reads the raw data.

IDVAR=variable-name

specifies the name of a SAS character variable that identifies the system or machine that generated the input data. The name of this variable can be no more than 32 characters in length.

SOURCE=identifier

specifies a unique three-character code that identifies the type of data.

TIMESTMP=timestamp-variable-name

specifies the name of the SAS variable that contains the datetime stamp that uniquely identifies the time of the event or interval being recorded.

The SOURCE entries for the supported adapters are listed in the following table.
Source Names for Each Adapter
ADAPTER
Value for the SOURCE Parameter for %RMDUPCHK
Amazon CloudWatch
(Multiple values are needed so that %RMDUPCHK can be invoked with each value.)
ASG TMON2CIC
TM2
ASG TMONDB2
TMD
ASG TMONDB2 V5
TM5
BMC Mainview
IMF
BMC Perf Mgr
PAT
CA TMS
TMS
Comma Separated Values
CSV
DT Perf Sentry
NTS
DT Perf Sentry with MXG
NTS
Ganglia
GAN
HP Perf Agent
(Multiple values are needed so that %RMDUPCHK can be invoked with each value.)
HP Reporter
(Multiple values are needed so that %RMDUPCHK can be invoked with each value.)
IBM DCOLLECT
DCO
IBM EREP
ERP
IBM IMS
IMS
IBM SMF
SMF
IBM TPF
TPF
IBM VMMON
VMM
MS SCOM
SCO
RRDtool
RRD
SAP ERP
BAT, SAP, and others. (Multiple values are needed so that %RMDUPCHK can be invoked with each value.)
SAR
SAR
SAS EV
(Multiple values are needed so that %RMDUPCHK can be invoked with each table source value.)
SNMP
SNM
VMware vCenter
(Multiple values are needed so that %RMDUPCHK can be invoked with each value.)
Web Log
WWW

%RMDUPCHK Options

FORCE=YES | NO

specifies whether duplicate input data should still be processed, or whether it is a duplicate.

  • FORCE=YES indicates that, if a duplicate is detected, the duplicate data should be processed.
  • FORCE=NO indicates that duplicate data should not be processed.
The default value for this option is NO.

INT=interval

represents the maximum time gap (or interval) that is to be allowed between the timestamps on any two consecutive records from the same system or machine. If the interval between the timestamp values exceeds the value of this option, then an observation with the new time range is created in the control data set. This is referred to as a gap in the data.

The value for this option must be provided in the format hh:mm, where hh represents hours and mm represents minutes. For example, to specify an interval of 14 minutes, use INT=0:14. To specify an interval of 1 hour and 29 minutes, use INT=1:29.
The default value for this option is 0:29, or 29 minutes.

KEEP=number-of-weeks

specifies the number of weeks for which control data should be kept. Because this value represents the number of Sundays between two dates, a value of 2 results in a maximum retention period of 20 days.

The default value for this option is 2.

REPORT=YES | NO

specifies whether to display the duplicate-data checking messages in the SAS log or to save the messages in an audit table. If set to Yes, this parameter causes all the messages from duplicate-data checking to appear in the SAS log. If set to No, the duplicate-data checking messages are saved in an audit data table that is stored in the staging library. The name of the audit table is source AUDIT (where source is the 3-character data source code).

The default value of this option is YES.
Note: If you are monitoring very high numbers of resources, setting this option to NO can be beneficial. Eliminating the report reduces CPU consumption, shortens elapsed time, and makes the SAS log more manageable.

TERM=YES | NO

controls whether SAS terminates if any duplicate input data is detected.

The default value of this option is NO.

%RMDUPCHK Notes

The Adapter Setup wizard prompts the user to specify how to handle duplicate records. Valid entries for the mode of duplicate-data checking are Discard, Force, or Terminate.
  • Discard: Duplicate-data-checking macros are executed.
    FORCE=NO
    and
    TERM=NO
    are implied.
  • Force: Duplicate-data-checking macros are executed. FORCE=YES and TERM=NO are implied.
  • Terminate: Duplicate-data-checking macros are executed.
    FORCE=NO
    and
    TERM=YES
    are implied.
You can change the mode of duplicate-data-checking for a table on the Properties dialog box for that table.
Note: For information about how to set up the %RMDUPCHK macro, see Working with Duplicate-Data Checking Macros. For information about how control data sets work, see Control Data Sets for Duplicate-Data Checking.

%RMDUPCHK Example

The following example provides duplicate checking for the data that is input from the NTSMF adapter:
%rmdupchk(
          endfile=_eof,
          idvar=machine,
          int=00:18,
          keep=52,
          report=yes,
          source=nts,
          timestmp=datetime
          );  

%RMDUPDSN

%RMDUPDSN Overview

%RMDUPDSN generates the name of the data duplication control data set. This is a temporary SAS data set that will contain datetime ranges for the data that is being processed when duplicate-data checking is enabled. This information is also used by other duplicate-data-checking macros, such as %RMDUPCHK and %RMDUPUPD.
For supplied adapters, the %RMDUPDSN macro is automatically submitted in the staging code when duplicate-data checking is enabled. For user-written adapters, the duplicate-data checking is not automatically enabled. For information about how to enable duplicate-data checking in the user-written staging code, see Duplicate Checking.

%RMDUPDSN Syntax

%RMDUPDSN(
SOURCE=identifier );

%RMDUPDSN Required Arguments

SOURCE=identifier

specifies a unique three-character code that identifies the type of data. It should be the same as the value that was coded for the SOURCE= parameter of the %RMDUPCHK macro.

%RMDUPDSN Notes

This macro executes only one time. It creates a global macro variable, &RMDUPDSN, that contains the name of the data duplication control data set. The global macro variable is resolved by the %RMDUPDSN macro.

%RMDUPDSN Example

This example shows the creation of a control data set, called cpnts.dsn, that is used to detect data duplication.
%rmdupdsn(
          source=nts
         );  

%RMDUPINT

%RMDUPINT Overview

%RMDUPINT sets up the macro definitions that are used by the other duplicate-data-checking macros. These macro definitions generate the name of the data duplication control data set.

%RMDUPINT Syntax

%RMDUPINT;

%RMDUPINT Notes

This macro requires no parameters. It contains the macro definitions and naming conventions for the control data set. The %RMDUPINT macro also defines the &RMDUPDSN macro variable that contains the data set name that is used in the %RMDUPDSN macro.
For supplied adapters, the %RMDUPINT macro is automatically submitted in the staging code when duplicate-data checking is enabled. For user-written adapters, the duplicate-data checking is not automatically enabled. To enable duplicate-data checking, in the user-written staging code, specify the %RMDUPINT macro in front of the staging code.
For information about how to enable duplicate-data checking in the user-written staging code, see Duplicate Checking.

%RMDUPINT Example

%rmdupint;

%RMDUPUPD

%RMDUPUPD Overview

%RMDUPUPD updates the permanent control data sets with information from a temporary control data set.

%RMDUPUPD Syntax

%RMDUPUPD;

%RMDUPUPD Notes

This macro requires no parameters. It reads the temporary control data set that was built by the %RMDUPCHK macro. It splits the contents into individual control data sets, depending on the number of staging transformations that contributed to the control data set.
The individual control data sets are then merged with their corresponding permanent data sets. If, during the merging process, the time interval between any two records is greater than the allowed time interval, then both records are output. (The allowed time interval is the value that was specified by the INT= option of the %RMDUPCHK macro.)
%RMDUPUPD subsequently generates a report that informs the user of the possibility of missing data. Records that relate to data that are older than the KEEP value are deleted. This macro is executed automatically by the staging transformations of all the supplied adapters. For user-written staging transformations, this macro must be coded to execute after the staged tables have been populated.

%RMDUPUPD Notes

This macro requires no parameters. It reads the temporary control data set that was built by the %RMDUPCHK macro. It splits the contents into individual control data sets, depending on the number of staging transformations that contributed to the control data set.
The individual control data sets are then merged with their corresponding permanent data sets. If, during the merging process, the time interval between any two records is greater than the allowed time interval, then both records are output. (The allowed time interval is the value that was specified by the INT= option of the %RMDUPCHK macro.)
%RMDUPUPD subsequently generates a report that informs the user of the possibility of missing data. Records that relate to data that are older than the KEEP value are deleted. This macro is executed automatically by the staging transformations of all the supplied adapters. For user-written staging transformations, this macro must be coded to execute after the staged tables have been populated.

%RMDUPUPD Example

%rmdupupd;