COMPARE Procedure

Results: COMPARE Procedure

Results Reporting

PROC COMPARE reports the results of its comparisons in the following ways:
  • the SAS log
  • return codes stored in the automatic macro SYSINFO
  • procedure output
  • output data sets

SAS Log

When you use the WARNING, PRINTALL, or ERROR option, PROC COMPARE writes a description of the differences to the SAS log.

Macro Return Codes (SYSINFO)

PROC COMPARE stores a return code in the automatic macro variable SYSINFO. The value of the return code provides information about the result of the comparison. By checking the value of SYSINFO after PROC COMPARE has run and before any other step begins, SAS macros can use the results of a PROC COMPARE step to determine what action to take or what parts of a SAS program to execute.
The following table is a key for interpreting the SYSINFO return code from PROC COMPARE. For each of the conditions listed, the associated value is added to the return code if the condition is true. Thus, the SYSINFO return code is the sum of the codes listed in the following table for the applicable conditions:
Macro Return Codes
Bit
Condition
Code
Hexadecimal
Description
1
DSLABEL
1
0001X
Data set labels differ
2
DSTYPE
2
0002X
Data set types differ
3
INFORMAT
4
0004X
Variable has different informat
4
FORMAT
8
0008X
Variable has different format
5
LENGTH
16
0010X
Variable has different length
6
LABEL
32
0020X
Variable has different label
7
BASEOBS
64
0040X
Base data set has observation not in comparison
8
COMPOBS
128
0080X
Comparison data set has observation not in base
9
BASEBY
256
0100X
Base data set has BY group not in comparison
10
COMPBY
512
0200X
Comparison data set has BY group not in base
11
BASEVAR
1024
0400X
Base data set has variable not in comparison
12
COMPVAR
2048
0800X
Comparison data set has variable not in base
13
VALUE
4096
1000X
A value comparison was unequal
14
TYPE
8192
2000X
Conflicting variable types
15
BYVAR
16384
4000X
BY variables do not match
16
ERROR
32768
8000X
Fatal error: comparison not done
These codes are ordered and scaled to enable a simple check of the degree to which the data sets differ. For example, if you want to check that two data sets contain the same variables, observations, and values, but you do not care about differences in labels, formats, and so on, then use the following statements:
proc compare base=SAS-data-set
             compare=SAS-data-set;
run;

%if &sysinfo >= 64 %then
   %do;
      handle error;
   %end;
You can examine individual bits in the SYSINFO value by using DATA step bit-testing features to check for specific conditions. For example, to check for the presence of observations in the base data set that are not in the comparison data set, use the following statements:
proc compare base=SAS-data-set
             compare=SAS-data-set;
run;

%let rc=&sysinfo;
data _null_;
   if &rc='1......'b then
      put 'Observations in Base but not
           in Comparison Data Set';
run;
PROC COMPARE must run before you check SYSINFO and you must obtain the SYSINFO value before another SAS step starts because every SAS step resets SYSINFO.

Procedure Output

Procedure Output Overview

The following sections show and describe the default output of the two data sets shown in Overview: COMPARE Procedure. Because PROC COMPARE produces lengthy output, the output is presented in seven pieces.
options nodate pageno=1 linesize=80 pagesize=60;
proc compare base=proclib.one compare=proclib.two;
  run;

Data Set Summary

This report lists the attributes of the data sets that are being compared. These attributes include the following:
  • the data set names
  • the data set types, if any
  • the data set labels, if any
  • the dates created and last modified
  • the number of variables in each data set
  • the number of observations in each data set

Variables Summary

This report compares the variables in the two data sets. The first part of the report lists the following:
  • the number of variables the data sets have in common
  • the number of variables in the base data set that are not in the comparison data set and vice versa
  • the number of variables in both data sets that have different types
  • the number of variables that differ on other attributes (length, label, format, or informat)
  • the number of BY, ID, VAR, and WITH variables specified for the comparison
The second part of the report lists matching variables with different attributes and shows how the attributes differ. (The COMPARE procedure omits variable labels if the line size is too small for them.)
The following output shows the Data Set Summary and the Variables Summary.
Partial Output Showing the Data Set Summary and Variables Summary
Data Set Summary and Variables Summary

Observation Summary

This report provides information about observations in the base and comparison data sets. First of all, the report identifies the first and last observation in each data set, the first and last matching observations, and the first and last different observations. Then, the report lists the following:
  • the number of observations that the data sets have in common
  • the number of observations in the base data set that are not in the comparison data set and vice versa
  • the total number of observations in each data set
  • the number of matching observations for which PROC COMPARE judged some variables unequal
  • the number of matching observations for which PROC COMPARE judged all variables equal
The following output shows the Observation Summary.
Partial Output Showing the Observation Summary
The Observation Summary

Values Comparison Summary

This report first lists the following:
  • the number of variables compared with all observations equal
  • the number of variables compared with some observations unequal
  • the number of variables with differences involving missing values, if any
  • the total number of values judged unequal
  • the maximum difference measure between unequal values for all pairs of matching variables (for differences not involving missing values)
In addition, for the variables for which some matching observations have unequal values, the report lists the following:
  • the name of the variable
  • other variable attributes
  • the number of times PROC COMPARE judged the variable unequal
  • the maximum difference measure found between values (for differences not involving missing values)
  • the number of differences caused by comparison with missing values, if any
The following output shows the Values Comparison Summary.
Partial Output Showing the Values Comparison Summary
Partial Output Showing the Values Comparison Summary

Value Comparison Results

This report consists of a table for each pair of matching variables judged unequal at one or more observations. When comparing character values, PROC COMPARE displays only the first 20 characters. When you use the TRANSPOSE option, it displays only the first 12 characters. Each table shows the following:
  • the number of the observation or, if you use the ID statement, the values of the ID variables
  • the value of the variable in the base data set
  • the value of the variable in the comparison data set
  • the difference between these two values (numeric variables only)
  • the percent difference between these two values (numeric variables only)
The following output shows the Value Comparison Results for Variables.
Partial Output Showing the Value Comparison Results for Variables
Value Comparison Results for Variables
You can suppress the value comparison results with the NOVALUES option. If you use both the NOVALUES and TRANSPOSE options, then PROC COMPARE lists for each observation the names of the variables with values judged unequal but does not display the values and differences.

Table of Summary Statistics

If you use the STATS, ALLSTATS, or PRINTALL option, then the Value Comparison Results for Variables section contains summary statistics for the numeric variables that are being compared. The STATS option generates these statistics for only the numeric variables whose values are judged unequal. The ALLSTATS and PRINTALL options generate these statistics for all numeric variables, even if all values are judged equal.
Note: In all cases PROC COMPARE calculates the summary statistics based on all matching observations that do not contain missing values, not just on those containing unequal values.
The following output shows the following summary statistics for base data set values, comparison data set values, differences, and percent differences:
N
the number of nonmissing values.
MEAN
the mean, or average, of the values.
STD
the standard deviation.
MAX
the maximum value.
MIN
the minimum value.
MISSDIFF
the number of missing values in either a base or compare data set.
STDERR
the standard error of the mean.
T
the T ratio (MEAN/STDERR).
PROB> | T |
the probability of a greater absolute T value if the true population mean is 0.
NDIF
the number of matching observations judged unequal, and the percent of the matching observations that were judged unequal.
DIFMEANS
the difference between the mean of the base values and the mean of the comparison values. This line contains three numbers. The first is the mean expressed as a percentage of the base values mean. The second is the mean expressed as a percentage of the comparison values mean. The third is the difference in the two means (the comparison mean minus the base mean).
R
the correlation of the base and comparison values for matching observations that are nonmissing in both data sets.
RSQ
the square of the correlation of the base and comparison values for matching observations that are nonmissing in both data sets.
The following output is from the ALLSTATS option using the two data sets shown in “Overview”:
options nodate pageno=1
   linesize=80 pagesize=60;
proc compare base=proclib.one
     compare=proclib.two allstats;
   title 'Comparing Two Data Sets: Default Report';
run;
Partial Output
Value Comparison Results for Variables
Note: If you use a wide line size with PRINTALL, then PROC COMPARE prints the value comparison result for character variables next to the result for numeric variables. In that case, PROC COMPARE calculates only NDIF for the character variables.

Comparison Results for Observations (Using the TRANSPOSE Option)

The TRANSPOSE option prints the comparison results by observation instead of by variable. The comparison results precede the observation summary report. By default, the source of the values for each row of the table is indicated by the following label:
 _OBS_1=number-1  _OBS_2=number-2
where number-1 is the number of the observation in the base data set for which the value of the variable is shown, and number-2 is the number of the observation in the comparison data set.
The following output shows the differences in PROCLIB.ONE and PROCLIB.TWO by observation instead of by variable.
options nodate pageno=1
   linesize=80 pagesize=60;
proc compare base=proclib.one
     compare=proclib.two transpose;
   title 'Comparing Two Data Sets: Default Report';
run;
Partial Output
Comparison Results for Observations
If you use an ID statement, then the identifying label has the following form:
ID-1=ID-value-1 ... ID-n=ID-value-n
where ID is the name of an ID variable and ID-value is the value of the ID variable.
Note: When you use the TRANSPOSE option, PROC COMPARE prints only the first 12 characters of the value.

ODS Table Names

The COMPARE procedure assigns a name to each table that it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. For more information, see SAS Output Delivery System: User's Guide.
ODS Tables Produced by the COMPARE Procedure
Table Name
Description
Conditions When Table Is Generated
CompareData sets
Information about the data set or data sets
By default, unless NOSUMMARY or NOVALUES option is specified
CompareDetails (Comparison Results for Observations)
A listing of observations that the base data set and the compare data set do not have in common
If PRINTALL option is specified
CompareDetails (ID variable notes and warnings)
A listing of notes and warnings concerning duplicate ID variable values
If ID statement is specified and duplicate ID variable values exist in either data set
CompareDifferences
A report of variable value differences
By default unless NOVALUES option is specified
CompareSummary
Summary report of observations, values, and variables with unequal values
By default
CompareVariables
A listing of differences in variable types or attributes between the base data set and the compare data set
By default, unless the variables are identical or the NOSUMMARY option is specified

Output Data Set (OUT=)

By default, the OUT= data set contains an observation for each pair of matching observations. The OUT= data set contains the following variables from the data sets you are comparing:
  • all variables named in the BY statement
  • all variables named in the ID statement
  • all matching variables or, if you use the VAR statement, all variables listed in the VAR statement
In addition, the data set contains two variables created by PROC COMPARE to identify the source of the values for the matching variables: _TYPE_ and _OBS_.
_TYPE_
is a character variable of length 8. Its value indicates the source of the values for the matching (or VAR) variables in that observation. (For ID and BY variables, which are not compared, the values are the values from the original data sets.) _TYPE_ has the label Type of Observation. The four possible values of this variable are as follows:
BASE
the values in this observation are from an observation in the base data set. PROC COMPARE writes this type of observation to the OUT= data set when you specify the OUTBASE option.
COMPARE
the values in this observation are from an observation in the comparison data set. PROC COMPARE writes this type of observation to the OUT= data set when you specify the OUTCOMP option.
DIF
the values in this observation are the differences between the values in the base and comparison data sets. For character variables, PROC COMPARE uses a period (.) to represent equal characters and an X to represent unequal characters. PROC COMPARE writes this type of observation to the OUT= data set by default. However, if you request any other type of observation with the OUTBASE, OUTCOMP, or OUTPERCENT option, then you must specify the OUTDIF option to generate observations of this type in the OUT= data set.
PERCENT
the values in this observation are the percent differences between the values in the base and comparison data sets. For character variables the values in observations of type PERCENT are the same as the values in observations of type DIF.
_OBS_
is a numeric variable that contains a number further identifying the source of the OUT= observations.
For observations with _TYPE_ equal to BASE, _OBS_ is the number of the observation in the base data set from which the values of the VAR variables were copied. Similarly, for observations with _TYPE_ equal to COMPARE, _OBS_ is the number of the observation in the comparison data set from which the values of the VAR variables were copied.
For observations with _TYPE_ equal to DIF or PERCENT, _OBS_ is a sequence number that counts the matching observations in the BY group.
_OBS_ has the label Observation Number.
The COMPARE procedure takes variable names and attributes for the OUT= data set from the base data set except for the lengths of ID and VAR variables, for which it uses the longer length regardless of which data set that length is from. This behavior has two important repercussions:
  • If you use the VAR and WITH statements, then the names of the variables in the OUT= data set come from the VAR statement. Thus, observations with _TYPE_ equal to BASE contain the values of the VAR variables, while observations with _TYPE_ equal to COMPARE contain the values of the WITH variables.
  • If you include a variable more than once in the VAR statement in order to compare it with more than one variable, then PROC COMPARE can include only the first comparison in the OUT= data set because each variable must have a unique name. Other comparisons produce warning messages.
For an example of the OUT= option, see Comparing Values of Observations Using an Output Data Set (OUT=).

Output Statistics Data Set (OUTSTATS=)

When you use the OUTSTATS= option, PROC COMPARE calculates the same summary statistics as the ALLSTATS option for each pair of numeric variables compared (see Table of Summary Statistics). The OUTSTATS= data set contains an observation for each summary statistic for each pair of variables. The data set also contains the BY variables used in the comparison and several variables created by PROC COMPARE:
_VAR_
is a character variable that contains the name of the variable from the base data set for which the statistic in the observation was calculated.
_WITH_
is a character variable that contains the name of the variable from the comparison data set for which the statistic in the observation was calculated. The _WITH_ variable is not included in the OUTSTATS= data set unless you use the WITH statement.
_TYPE_
is a character variable that contains the name of the statistic contained in the observation. Values of the _TYPE_ variable are N, MEAN, STD, MIN, MAX, STDERR, T, PROBT, NDIF, DIFMEANS, and R, RSQ.
_BASE_
is a numeric variable that contains the value of the statistic calculated from the values of the variable named by _VAR_ in the observations in the base data set with matching observations in the comparison data set.
_COMP_
is a numeric variable that contains the value of the statistic calculated from the values of the variable named by the _VAR_ variable (or by the _WITH_ variable if you use the WITH statement) in the observations in the comparison data set with matching observations in the base data set.
_DIF_
is a numeric variable that contains the value of the statistic calculated from the differences of the values of the variable named by the _VAR_ variable in the base data set and the matching variable (named by the _VAR_ or _WITH_ variable) in the comparison data set.
_PCTDIF_
is a numeric variable that contains the value of the statistic calculated from the percent differences of the values of the variable named by the _VAR_ variable in the base data set and the matching variable (named by the _VAR_ or _WITH_ variable) in the comparison data set.
Note: For both types of output data sets, PROC COMPARE assigns one of the following data set labels:
Comparison of base-SAS-data-set
with comparison-SAS-data-set

Comparison of variables in base-SAS-data-set
Labels are limited to 40 characters.
See Creating an Output Data Set of Statistics (OUTSTATS=) for an example of an OUTSTATS= data set.