The COMPARE Procedure |
Results Reporting |
PROC COMPARE reports the results of its comparisons in the following ways:
the SAS log
return codes stored in the automatic macro SYSINFO
procedure output
output data sets.
SAS Log |
When you use the WARNING, PRINTALL, or ERROR option, PROC COMPARE writes a description of the differences to the SAS log.
Macro Return Codes (SYSINFO) |
PROC COMPARE stores a return code in the automatic macro variable SYSINFO. The value of the return code provides information about the result of the comparison. By checking the value of SYSINFO after PROC COMPARE has run and before any other step begins, SAS macros can use the results of a PROC COMPARE step to determine what action to take or what parts of a SAS program to execute.
The following table is a key for interpreting the SYSINFO return code from PROC COMPARE. For each of the conditions listed, the associated value is added to the return code if the condition is true. Thus, the SYSINFO return code is the sum of the codes listed in the following table for the applicable conditions:
Bit | Condition | Code | Hex | Description |
---|---|---|---|---|
1 | DSLABEL | 1 | 0001X | Data set labels differ |
2 | DSTYPE | 2 | 0002X | Data set types differ |
3 | INFORMAT | 4 | 0004X | Variable has different informat |
4 | FORMAT | 8 | 0008X | Variable has different format |
5 | LENGTH | 16 | 0010X | Variable has different length |
6 | LABEL | 32 | 0020X | Variable has different label |
7 | BASEOBS | 64 | 0040X | Base data set has observation not in comparison |
8 | COMPOBS | 128 | 0080X | Comparison data set has observation not in base |
9 | BASEBY | 256 | 0100X | Base data set has BY group not in comparison |
10 | COMPBY | 512 | 0200X | Comparison data set has BY group not in base |
11 | BASEVAR | 1024 | 0400X | Base data set has variable not in comparison |
12 | COMPVAR | 2048 | 0800X | Comparison data set has variable not in base |
13 | VALUE | 4096 | 1000X | A value comparison was unequal |
14 | TYPE | 8192 | 2000X | Conflicting variable types |
15 | BYVAR | 16384 | 4000X | BY variables do not match |
16 | ERROR | 32768 | 8000X | Fatal error: comparison not done |
These codes are ordered and scaled to enable a simple check of the degree to which the data sets differ. For example, if you want to check that two data sets contain the same variables, observations, and values, but you do not care about differences in labels, formats, and so on, then use the following statements:
proc compare base=SAS-data-set compare=SAS-data-set; run; %if &sysinfo >= 64 %then %do; handle error; %end;
You can examine individual bits in the SYSINFO value by using DATA step bit-testing features to check for specific conditions. For example, to check for the presence of observations in the base data set that are not in the comparison data set, use the following statements:
proc compare base=SAS-data-set compare=SAS-data-set; run; %let rc=&sysinfo; data _null_; if &rc='1......'b then put 'Observations in Base but not in Comparison Data Set'; run;
PROC COMPARE must run before you check SYSINFO and you must obtain the SYSINFO value before another SAS step starts because every SAS step resets SYSINFO.
Procedure Output |
The following sections show and describe the default output of the two data sets shown in Overview: COMPARE Procedure. Because PROC COMPARE produces lengthy output, the output is presented in seven pieces.
This report lists the attributes of the data sets that are being compared. These attributes include the following:
the data set names
the data set types, if any
the data set labels, if any
the dates created and last modified
the number of variables in each data set
the number of observations in each data set.
The following output shows the Data Set Summary.
COMPARE Procedure Comparison of PROCLIB.ONE with PROCLIB.TWO (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs Label PROCLIB.ONE 11SEP97:15:11:07 11SEP97:15:11:09 5 4 First Data Set PROCLIB.TWO 11SEP97:15:11:10 11SEP97:15:11:10 6 5 Second Data Set
This report compares the variables in the two data sets. The first part of the report lists the following:
the number of variables the data sets have in common
the number of variables in the base data set that are not in the comparison data set and vice versa
the number of variables in both data sets that have different types
the number of variables that differ on other attributes (length, label, format, or informat)
the number of BY, ID, VAR, and WITH variables specified for the comparison.
The second part of the report lists matching variables with different attributes and shows how the attributes differ. (The COMPARE procedure omits variable labels if the line size is too small for them.)
The following output shows the Variables Summary.
Variables Summary Number of Variables in Common: 5. Number of Variables in PROCLIB.TWO but not in PROCLIB.ONE: 1. Number of Variables with Conflicting Types: 1. Number of Variables with Differing Attributes: 3. Listing of Common Variables with Conflicting Types Variable Dataset Type Length student PROCLIB.ONE Num 8 PROCLIB.TWO Char 8 Listing of Common Variables with Differing Attributes Variable Dataset Type Length Format Label year PROCLIB.ONE Char 8 Year of Birth PROCLIB.TWO Char 8 state PROCLIB.ONE Char 8 PROCLIB.TWO Char 8 Home State gr1 PROCLIB.ONE Num 8 4.1 PROCLIB.TWO Num 8 5.2
This report provides information about observations in the base and comparison data sets. First of all, the report identifies the first and last observation in each data set, the first and last matching observations, and the first and last different observations. Then, the report lists the following:
the number of observations that the data sets have in common
the number of observations in the base data set that are not in the comparison data set and vice versa
the total number of observations in each data set
the number of matching observations for which PROC COMPARE judged some variables unequal
the number of matching observations for which PROC COMPARE judged all variables equal.
Partial Output shows the Observation Summary.
Observation Summary Observation Base Compare First Obs 1 1 First Unequal 1 1 Last Unequal 4 4 Last Match 4 4 Last Obs . 5 Number of Observations in Common: 4. Number of Observations in PROCLIB.TWO but not in PROCLIB.ONE: 1. Total Number of Observations Read from PROCLIB.ONE: 4. Total Number of Observations Read from PROCLIB.TWO: 5. Number of Observations with Some Compared Variables Unequal: 4. Number of Observations with All Compared Variables Equal: 0.
This report first lists the following:
the number of variables compared with all observations equal
the number of variables compared with some observations unequal
the number of variables with differences involving missing values, if any
the total number of values judged unequal
the maximum difference measure between unequal values for all pairs of matching variables (for differences not involving missing values).
In addition, for the variables for which some matching observations have unequal values, the report lists
the name of the variable
other variable attributes
the number of times PROC COMPARE judged the variable unequal
the maximum difference measure found between values (for differences not involving missing values)
the number of differences caused by comparison with missing values, if any.
The following output shows the Values Comparison Summary.
Values Comparison Summary Number of Variables Compared with All Observations Equal: 1. Number of Variables Compared with Some Observations Unequal: 3. Total Number of Values which Compare Unequal: 6. Maximum Difference: 20. Variables with Unequal Values Variable Type Len Compare Label Ndif MaxDif state CHAR 8 Home State 2 gr1 NUM 8 2 1.000 gr2 NUM 8 2 20.000
This report consists of a table for each pair of matching variables judged unequal at one or more observations. When comparing character values, PROC COMPARE displays only the first 20 characters. When you use the TRANSPOSE option, it displays only the first 12 characters. Each table shows
the number of the observation or, if you use the ID statement, the values of the ID variables
the value of the variable in the base data set
the value of the variable in the comparison data set
the difference between these two values (numeric variables only)
the percent difference between these two values (numeric variables only).
The following output shows the Value Comparison Results for Variables.
Value Comparison Results for Variables __________________________________________________________ || Home State || Base Value Compare Value Obs || state state ________ || ________ ________ || 2 || MD MA 4 || MA MD __________________________________________________________ __________________________________________________________ || Base Compare Obs || gr1 gr1 Diff. % Diff ________ || _________ _________ _________ _________ || 1 || 85.0 84.00 -1.0000 -1.1765 3 || 78.0 79.00 1.0000 1.2821 __________________________________________________________ __________________________________________________________ || Base Compare Obs || gr2 gr2 Diff. % Diff ________ || _________ _________ _________ _________ || 3 || 72.0000 73.0000 1.0000 1.3889 4 || 94.0000 74.0000 -20.0000 -21.2766 __________________________________________________________
You can suppress the value comparison results with the NOVALUES option. If you use both the NOVALUES and TRANSPOSE options, then PROC COMPARE lists for each observation the names of the variables with values judged unequal but does not display the values and differences.
If you use the STATS, ALLSTATS, or PRINTALL option, then the Value Comparison Results for Variables section contains summary statistics for the numeric variables that are being compared. The STATS option generates these statistics for only the numeric variables whose values are judged unequal. The ALLSTATS and PRINTALL options generate these statistics for all numeric variables, even if all values are judged equal.
Note: In all cases PROC COMPARE calculates the summary statistics based on all matching observations that do not contain missing values, not just on those containing unequal values.
The following output shows the following summary statistics for base data set values, comparison data set values, differences, and percent differences:
the number of nonmissing values
the mean, or average, of the values
the standard deviation
the maximum value
the minimum value
the number of missing values in either a base or compare data set
the standard error of the mean
the T ratio (MEAN/STDERR)
the probability of a greater absolute T value if the true population mean is 0.
the number of matching observations judged unequal, and the percent of the matching observations that were judged unequal.
the difference between the mean of the base values and the mean of the comparison values. This line contains three numbers. The first is the mean expressed as a percentage of the base values mean. The second is the mean expressed as a percentage of the comparison values mean. The third is the difference in the two means (the comparison mean minus the base mean).
the correlation of the base and comparison values for matching observations that are nonmissing in both data sets.
the square of the correlation of the base and comparison values for matching observations that are nonmissing in both data sets.
The following output is from the ALLSTATS option using the two data sets shown in "Overview":
Value Comparison Results for Variables __________________________________________________________ || Base Compare Obs || gr1 gr1 Diff. % Diff ________ || _________ _________ _________ _________ || 1 || 85.0 84.00 -1.0000 -1.1765 3 || 78.0 79.00 1.0000 1.2821 ________ || _________ _________ _________ _________ || N || 4 4 4 4 Mean || 85.5000 85.5000 0 0.0264 Std || 5.8023 5.4467 0.8165 1.0042 Max || 92.0000 92.0000 1.0000 1.2821 Min || 78.0000 79.0000 -1.0000 -1.1765 StdErr || 2.9011 2.7234 0.4082 0.5021 t || 29.4711 31.3951 0.0000 0.0526 Prob>|t| || <.0001 <.0001 1.0000 0.9614 || Ndif || 2 50.000% DifMeans || 0.000% 0.000% 0 r, rsq || 0.991 0.983 __________________________________________________________ __________________________________________________________ || Base Compare Obs || gr2 gr2 Diff. % Diff ________ || _________ _________ _________ _________ || 3 || 72.0000 73.0000 1.0000 1.3889 4 || 94.0000 74.0000 -20.0000 -21.2766 ________ || _________ _________ _________ _________ || N || 4 4 4 4 Mean || 86.2500 81.5000 -4.7500 -4.9719 Std || 9.9457 9.4692 10.1776 10.8895 Max || 94.0000 92.0000 1.0000 1.3889 Min || 72.0000 73.0000 -20.0000 -21.2766 StdErr || 4.9728 4.7346 5.0888 5.4447 t || 17.3442 17.2136 -0.9334 -0.9132 Prob>|t| || 0.0004 0.0004 0.4195 0.4285 || Ndif || 2 50.000% DifMeans || -5.507% -5.828% -4.7500 r, rsq || 0.451 0.204 __________________________________________________________
Note: If you use a wide line size with PRINTALL, then PROC COMPARE prints the value comparison result for character variables next to the result for numeric variables. In that case, PROC COMPARE calculates only NDIF for the character variables.
The TRANSPOSE option prints the comparison results by observation instead of by variable. The comparison results precede the observation summary report. By default, the source of the values for each row of the table is indicated by the following label:
_OBS_1=number-1 _OBS_2=number-2
where number-1 is the number of the observation in the base data set for which the value of the variable is shown, and number-2 is the number of the observation in the comparison data set.
The following output shows the differences in PROCLIB.ONE and PROCLIB.TWO by observation instead of by variable.
Comparison Results for Observations _OBS_1=1 _OBS_2=1: Variable Base Value Compare Diff. % Diff gr1 85.0 84.00 -1.000000 -1.176471 _OBS_1=2 _OBS_2=2: Variable Base Value Compare state MD MA _OBS_1=3 _OBS_2=3: Variable Base Value Compare Diff. % Diff gr1 78.0 79.00 1.000000 1.282051 gr2 72.000000 73.000000 1.000000 1.388889 _OBS_1=4 _OBS_2=4: Variable Base Value Compare Diff. % Diff gr2 94.000000 74.000000 -20.000000 -21.276596 state MA MD
If you use an ID statement, then the identifying label has the following form:
ID-1=ID-value-1 ... ID-n=ID-value-n
where ID is the name of an ID variable and ID-value is the value of the ID variable.
Note: When you use the TRANSPOSE option, PROC COMPARE prints only the first 12 characters of the value.
ODS Table Names |
The COMPARE procedure assigns a name to each table that it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. For more information, see SAS Output Delivery System: User's Guide.
Table Name | Description | Generated... |
---|---|---|
CompareData sets | Information about the data set or data sets | by default, unless NOSUMMARY or NOVALUES option is specified |
CompareDetails (Comparison Results for Observations) | A listing of observations that the base data set and the compare data set do not have in common | if PRINTALL option is specified |
CompareDetails (ID variable notes and warnings) | A listing of notes and warnings concerning duplicate ID variable values | if ID statement is specified and duplicate ID variable values exist in either data set |
CompareDifferences | A report of variable value differences | by default unless NOVALUES option is specified |
CompareSummary | Summary report of observations, values, and variables with unequal values | by default |
CompareVariables | A listing of differences in variable types or attributes between the base data set and the compare data set | by default, unless the variables are identical or the NOSUMMARY option is specified |
Output Data Set (OUT=) |
By default, the OUT= data set contains an observation for each pair of matching observations. The OUT= data set contains the following variables from the data sets you are comparing:
all variables named in the BY statement
all variables named in the ID statement
all matching variables or, if you use the VAR statement, all variables listed in the VAR statement.
In addition, the data set contains two variables created by PROC COMPARE to identify the source of the values for the matching variables: _TYPE_ and _OBS_.
is a character variable of length 8. Its value indicates the source of the values for the matching (or VAR) variables in that observation. (For ID and BY variables, which are not compared, the values are the values from the original data sets.) _TYPE_ has the label Type of Observation . The four possible values of this variable are as follows:
The values in this observation are from an observation in the base data set. PROC COMPARE writes this type of observation to the OUT= data set when you specify the OUTBASE option.
The values in this observation are from an observation in the comparison data set. PROC COMPARE writes this type of observation to the OUT= data set when you specify the OUTCOMP option.
The values in this observation are the differences between the values in the base and comparison data sets. For character variables, PROC COMPARE uses a period (.) to represent equal characters and an X to represent unequal characters. PROC COMPARE writes this type of observation to the OUT= data set by default. However, if you request any other type of observation with the OUTBASE, OUTCOMP, or OUTPERCENT option, then you must specify the OUTDIF option to generate observations of this type in the OUT= data set.
The values in this observation are the percent differences between the values in the base and comparison data sets. For character variables the values in observations of type PERCENT are the same as the values in observations of type DIF.
is a numeric variable that contains a number further identifying the source of the OUT= observations.
For observations with _TYPE_ equal to BASE, _OBS_ is the number of the observation in the base data set from which the values of the VAR variables were copied. Similarly, for observations with _TYPE_ equal to COMPARE, _OBS_ is the number of the observation in the comparison data set from which the values of the VAR variables were copied.
For observations with _TYPE_ equal to DIF or PERCENT, _OBS_ is a sequence number that counts the matching observations in the BY group.
The COMPARE procedure takes variable names and attributes for the OUT= data set from the base data set except for the lengths of ID and VAR variables, for which it uses the longer length regardless of which data set that length is from. This behavior has two important repercussions:
If you use the VAR and WITH statements, then the names of the variables in the OUT= data set come from the VAR statement. Thus, observations with _TYPE_ equal to BASE contain the values of the VAR variables, while observations with _TYPE_ equal to COMPARE contain the values of the WITH variables.
If you include a variable more than once in the VAR statement in order to compare it with more than one variable, then PROC COMPARE can include only the first comparison in the OUT= data set because each variable must have a unique name. Other comparisons produce warning messages.
Output Statistics Data Set (OUTSTATS=) |
When you use the OUTSTATS= option, PROC COMPARE calculates the same summary statistics as the ALLSTATS option for each pair of numeric variables compared (see Table of Summary Statistics). The OUTSTATS= data set contains an observation for each summary statistic for each pair of variables. The data set also contains the BY variables used in the comparison and several variables created by PROC COMPARE:
is a character variable that contains the name of the variable from the base data set for which the statistic in the observation was calculated.
is a character variable that contains the name of the variable from the comparison data set for which the statistic in the observation was calculated. The _WITH_ variable is not included in the OUTSTATS= data set unless you use the WITH statement.
is a character variable that contains the name of the statistic contained in the observation. Values of the _TYPE_ variable are N , MEAN , STD , MIN , MAX , STDERR , T , PROBT , NDIF , DIFMEANS , and R , RSQ .
is a numeric variable that contains the value of the statistic calculated from the values of the variable named by _VAR_ in the observations in the base data set with matching observations in the comparison data set.
is a numeric variable that contains the value of the statistic calculated from the values of the variable named by the _VAR_ variable (or by the _WITH_ variable if you use the WITH statement) in the observations in the comparison data set with matching observations in the base data set.
is a numeric variable that contains the value of the statistic calculated from the differences of the values of the variable named by the _VAR_ variable in the base data set and the matching variable (named by the _VAR_ or _WITH_ variable) in the comparison data set.
is a numeric variable that contains the value of the statistic calculated from the percent differences of the values of the variable named by the _VAR_ variable in the base data set and the matching variable (named by the _VAR_ or _WITH_ variable) in the comparison data set.
Note: For both types of output data sets, PROC COMPARE assigns one of the following data set labels:
Comparison of base-SAS-data-set with comparison-SAS-data-set Comparison of variables in base-SAS-data-set
Labels are limited to 40 characters.
See Creating an Output Data Set of Statistics (OUTSTATS=) for an example of an OUTSTATS= data set.
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.