Control Data Sets for Duplicate-Data Checking

Types of Control Data Sets

Control data sets (temporary, intermediate, and permanent) are the basis on which duplicate data is rejected. The following section describes how the control data sets are created, used, updated, and stored.
  • Temporary control data sets
    When input data is processed, the %RMDUPCHK macro creates a temporary control data set, WORK._DUPCNTL, which stores information about the raw data from one or more adapters. Specifically, for each machine or system that generated data, the temporary control data set stores the datetime ranges and record counts in the raw data. If raw data from more than one data source is processed in the same job, then information is appended to the temporary control data set with each execution of the %RMDUPCHK macro.
  • Intermediate control data sets
    When all the input data has been read and stored in the temporary control data set, the %RMDUPUPD macro writes the data from the temporary control data set, WORK._DUPCNTL, to separate intermediate data sets, which are located in the staging library for that staging job.
    Note: The first time a staging table is added to a staging transformation, a library is created to contain the control data information for all the staged tables that are created by that staging transformation. It is called adapter-nameStagingnnnn, where nnnn is a random number that ensures that the library name is unique within the IT data mart. For example, a library name might be DT Perf Sentry Staging 8926. (This library also contains other types of data.) If all the staged tables that are created by that staging transformation are subsequently deleted, and one or more new staged tables are added to the transformation, then a new library is created for the new staged tables. The first library that was created (for the original staged tables) is not used again. It is not automatically deleted, but you can delete it.
    If the data set exists, then the data set is used and the new data overwrites the old data. Otherwise, the data set is created and the new data is written to that data set.
  • Permanent control data sets
    The data in the intermediate control data sets is then merged by %RMDUPUPD into the corresponding permanent control data sets, which are named sourceCNTRL in the staging library for that staging job. The permanent control data sets are stored and maintained in the staging library. One permanent control data set, named sourceCNTRL, can exist for each adapter. Each data set contains information about that adapter's machines or systems, datetime ranges, and record counts.
    If a sourceCNTRL data set exists, the new data is merged into it. Otherwise, the data set is created and the new data is written to that data set.
    The permanent sourceAUDIT data set contains one observation for each ID column (such as MACHINE) and the number of new and potentially duplicate records that were processed.

Content of Permanent Control Data Sets

The content of the permanent control data sets is based on intervals and ranges in your data. The INT= parameter flags datetime gaps. When you specify the INT= parameter in %RMDUPCHK, you define the maximum gap allowable between records in the same range. If the gap between datetimes for two consecutive records exceeds this value, then a new range is created.
A range is deleted when the end-of-range datetime value is older than the number of weeks that are specified in the KEEP= parameter in %RMDUPCHK. However, if your data is continuous, then you have only one range. Your control information is never deleted because the end-of-range datetime value is constantly extended by new datetime information.
Here are the ways that ranges are used with continuous and non-continuous data:
  • If your data is continuous and does not have any datetime gaps that exceed the value of the INT= parameter, then your data always updates the same range. In this case, the permanent control data set contains one range for each unique value of the variable that is specified by the IDVAR= parameter. The values of that variable are typically the machine or system names from which the raw data originated. (Only the CSV, RRDtool, and user-written adapters require you to enter this parameter.)
  • If your data is not continuous, then the permanent control data set contains multiple ranges for each unique value of the variable that is specified by the IDVAR= parameter. Each range is prefixed with a value of the variable that is specified by the IDVAR= parameter.
Note: For the HP Reporter and VMware adapters, one set of control data sets is created for each table by default. You can create just one set of control data sets for these adapters. To do so, specify a three-character identifier in the SOURCE= parameter of the %RMDUPCHK macro. SAS IT Resource Management then prefixes that identifier to the control data set.