COMPARE Procedure

Concepts: COMPARE Procedure

Comparisons Using PROC COMPARE

PROC COMPARE first compares the following:
  • data set attributes (set by the data set options TYPE= and LABEL=).
  • variables. PROC COMPARE checks each variable in one data set to determine whether it matches a variable in the other data set.
  • attributes (type, length, labels, formats, and informats) of matching variables.
  • observations. PROC COMPARE checks each observation in one data set to determine whether it matches an observation in the other data set. PROC COMPARE either matches observations by their position in the data sets or by the values of the ID variable.
After making these comparisons, PROC COMPARE compares the values in the parts of the data sets that match. PROC COMPARE either compares the data by the position of observations or by the values of an ID variable.

A Comparison by Position of Observations

The following figure shows two data sets. The data inside the shaded boxes shows the part of the data sets that the procedure compares. Assume that variables with the same names have the same type.
Comparison by the Positions of Observations
Comparison by the Positions of Observations
When you use PROC COMPARE to compare data set TWO with data set ONE, the procedure compares the first observation in data set ONE with the first observation in data set TWO, and it compares the second observation in the first data set with the second observation in the second data set, and so on. In each observation that it compares, the procedure compares the values of the IDNUM, NAME, GENDER, and GPA.
The procedure does not report on the values of the last two observations or the variable YEAR in data set TWO because there is nothing to compare them with in data set ONE.

A Comparison with an ID Variable

In a simple comparison, PROC COMPARE uses the observation number to determine which observations to compare. When you use an ID variable, PROC COMPARE uses the values of the ID variable to determine which observations to compare. ID variables should have unique values and must have the same type.
For the two data sets shown in the following figure, assume that IDNUM is an ID variable and that IDNUM has the same type in both data sets. The procedure compares the observations that have the same value for IDNUM. The data inside the shaded boxes shows the part of the data sets that the procedure compares.
Comparison by the Value of the ID Variable
Comparison by the Value of the ID Variable
The data sets contain three matching variables: NAME, GENDER, and GPA. They also contain five matching observations: the observations with values of 2998, 9866, 2118, 3847, and 2342 for IDNUM.
Data Set TWO contains two observations (IDNUM=7565 and IDNUM=1755) for which data set ONE contains no matching observations. Similarly, no variable in data set ONE matches the variable YEAR in data set TWO.
See Comparing Observations with an ID Variable for an example that uses an ID variable.

The Equality Criterion

Using the CRITERION= Option

The COMPARE procedure judges numeric values unequal if the magnitude of their difference, as measured according to the METHOD= option, is greater than the value of the CRITERION= option. PROC COMPARE provides four methods for applying CRITERION=:
  • The EXACT method tests for exact equality.
  • The ABSOLUTE method compares the absolute difference to the value specified by CRITERION=.
  • The RELATIVE method compares the absolute relative difference to the value specified by CRITERION=.
  • The PERCENT method compares the absolute percent difference to the value specified by CRITERION=.
For a numeric variable compared, let x be its value in the base data set and let y be its value in the comparison data set. If both x and y are nonmissing, then the values are judged unequal according to the value of METHOD= and the value of CRITERION= (γ) as follows:
  • If METHOD=EXACT, then the values are unequal if y does not equal x.
  • If METHOD=ABSOLUTE, then the values are unequal if
  • If METHOD=RELATIVE, then the values are unequal if
    The values are equal if x=y=0.
  • If METHOD=PERCENT, then the values are unequal if
    or
If x or y is missing, then the comparison depends on the NOMISSING option. If the NOMISSING option is in effect, then a missing value will always be judged equal to anything. Otherwise, a missing value is judged equal only to a missing value of the same type (that is, .=., .^=.A, .A=.A, .A^=.B, and so on).
If the value that is specified for CRITERION= is negative, then the actual criterion that is used, γ, is equal to the absolute value of the specified criterion multiplied by a very small number, ε (epsilon), that depends on the numerical precision of the computer. This number ε is defined as the smallest positive floating-point value such that, using machine arithmetic, 1−ε<1<1+ε. Round-off or truncation error in floating-point computations is typically a few orders of magnitude larger than ε. CRITERION=−1000 often provides a reasonable test of the equality of computed results at the machine level of precision.
The value δ added to the denominator in the RELATIVE method is specified in parentheses after the method name: METHOD=RELATIVE(δ). If not specified in METHOD=, then δ defaults to 0. The value of δ can be used to control the behavior of the error measure when both x and y are very close to 0. If δ is not given and x and y are very close to 0, then any error produces a large relative error (in the limit, 2).
Specifying a value for δ avoids this extreme sensitivity of the RELATIVE method for small values. If you specify METHOD=RELATIVE(δ) CRITERION=γ when both x and y are much smaller than δ in absolute value, then the comparison is as if you had specified METHOD=ABSOLUTE CRITERION=δγ. However, when either x or y is much larger than δ in absolute value, the comparison is like METHOD=RELATIVE CRITERION=γ. For moderate values of x and y, METHOD=RELATIVE(δ) CRITERION=γ is, in effect, a compromise between METHOD=ABSOLUTE CRITERION=δ γ and METHOD=RELATIVE CRITERION=γ.
For character variables, if one value is longer than the other, then the shorter value is padded with blanks for the comparison. Nonblank character values are judged equal only if they agree at each character. If the NOMISSING option is in effect, then blank character values are judged equal to anything.

Definition of Difference and Percent Difference

In the reports of value comparisons and in the OUT= data set, PROC COMPARE displays difference and percent difference values for the numbers compared. These quantities are defined using the value from the base data set as the reference value. For a numeric variable compared, let x be its value in the base data set and let y be its value in the comparison data set. If x and y are both nonmissing, then the difference and percent difference are defined as follows:
  • Difference =
  • Percent Difference =
  • Percent Difference = missing for

How PROC COMPARE Handles Variable Formats

PROC COMPARE compares unformatted values. If you have two matching variables that are formatted differently, then PROC COMPARE lists the formats of the variables.