How Is Data Checked for Duplicates?

Four macros control the process of duplicate-data checking:
  • %RMDUPINT loads macro definitions that are used by the other duplicate-data checking macros.
  • %RMDUPDSN generates the name (WORK._DUPCNTL) of the temporary SAS data set that will contain datetime ranges for the data that is being processed into the active IT data mart. It also generates the name of the sourceAUDIT data set for duplicate checking.
  • %RMDUPCHK checks for duplicate data by examining timestamps on data being read by the staging code. This macro also writes to the temporary control data set.
  • %RMDUPUPD updates the permanent control data sets with information from a temporary control data set through the intermediate control data sets.
For more information about these duplicate-data checking macros, see Introduction to the Macros in SAS IT Resource Management.
Each of the duplicate-data checking macros performs a specific task. Together, these macros set up and manage duplicate-data checking. The macros are designed to check your data and to prevent duplicate data from being processed into the IT data mart. However, sometimes it is necessary to process data in a datetime range for which a machine's or system's data was already processed. For example, you might need to process data into a table that you did not use earlier or that you accidentally deleted. You can specify that the data is acceptable even though it appears to be duplicate data.
As raw data is being read, one of the macros that performs duplicate-data checking reviews the datetime information in each record. It then stores the information in a SAS data set called a temporary control data set. Later, by using intermediate control data sets, another macro merges the information in the temporary control data set into one or more SAS data sets that are called permanent control data sets.
When additional data is processed into the IT data mart, the timestamps of the incoming data are compared with the datetime information in the permanent control data sets in order to determine whether the new data has already been processed. If it has, the duplicate data is handled in the way that you specify.
A duplicate-data report is printed in the SAS log after the data is read. The report describes how many records were read for each machine or system and how many duplicates were found, if any. (If you specified Report = No on the Duplicate Checking page of the Staging Parameters tab of the staging transformation, this information is written to the sourceAUDIT file.)
Note: The first time you run a job with duplicate-data checking enabled, the permanent control data sets have not been built, so the macro %RMDUPCHK cannot check the input records. Your data is not checked or rejected for duplicates, but the permanent control data sets are created and the datetime information for this data is saved to them. Data is checked only on the datetime, although SMF data is also checked for the system name. (For example, if you try to add a new record type, but you have already read other record types from that adapter for that time period, the records are not kept.) The duplicate-data report contains only a limited amount of information about your data.