41018 - Finding near matches for case-control data using the FIND

Sample 41018: Finding near matches for case-control data using the FIND_NEIGHBORS macro

This macro is obsolete. For current functionality see the PSMATCH procedure in SAS/STAT® software.

Finding near matches for case-control data using the FIND_NEIGHBORS macro

Contents:

Purpose / Requirements / Usage / Details /

PURPOSE:

The purpose of this macro is to find pairs of observations that are as similar as possible so that the pairs can be used in a case-control analysis. At least one numeric variable is required. Categorical variables can be used in addition. Base SAS^® and SAS/STAT^® software must be installed.

REQUIREMENTS:

Base SAS and SAS/STAT software are required.

USAGE:

Follow the instructions in the Downloads tab of this sample to save the %FIND_NEIGHBORS macro definition. Before invoking the %FIND_NEIGHBORS macro, specify the following %INCLUDE statement in your SAS program or in the SAS editor window to define the macro and make it available for use. In the %INCLUDE statement, replace the text that is within quotes with the location of the %FIND_NEIGHBORS macro definition file that you save on your system.

%include "<location of your file containing the FIND_NEIGHBORS macro>"

Following that statement, you can invoke the %FIND_NEIGHBORS macro using this syntax:

%FIND_NEIGHBORS(<list of macro arguments separated by commas>)

For an example, see the Full Code tab.

HOW TO SET UP THE DATA AND INVOKE THE MACRO

In order to use the macro, you must create an input dataset that has these properties:

a single dataset that contains both the cases and the controls. Use the INPUT_DATA_SET= option to specify the dataset name.
a numeric variable that identifies which observations are cases and which observations are controls. Specify this numeric variable using the CASE_VARIABLE= option. By default, CASE_VARIABLE=CASE. Use the CASE_VALUE= option to specify the CASE_VARIABLE value that indicates which observations are cases. By default, CASE_VALUE=1.
an id variable that is unique for each observation. Specify the name using ID_VARIABLE=. This variable must be a character variable.

Use the NUMERIC_VARIABLES= option to list the numeric variable(s) that will be used to compute the Euclidean distance between observations. Specifying NUMERIC_VARIABLES= is required.

If there are variables that define categories that you want used for matching, then list the variables with the CATEGORICAL_VARIABLES= option. By default, no categorical variables are used.

By default, the macro will automatically standardize the NUMERIC_VARIABLES within each grouping. For example, if you are running the macro to find matches that are adjusted for categorical variables such as US state and gender (CATEGORICAL_VARIABLES=STATE GENDER), then the macro will standardize within each of the 100 (50x2) groupings. By default STANDARDIZE=Y. If you want to standardize the data in some other way prior to running the macro, then turn off the macro standardization by specifying STANDARDIZE=N. Some form of standardization is recommended. For details, see the chapters that document the clustering procedures in SAS/STAT User's Guide.

The name of the output data set that is created by the macro will be the same name as the INPUT_DATA_SET= data set with '_2' appended on the end. The two new variables are:

_MATCHED_PAIR_ID_: if two observations have the same value of _MATCHED_PAIR_ID_, then they are considered a case-control match. A 0 indicates that the observation is a case but no matching control was found. A missing value indicates that the observation is a control that is not matched to a case.
_DISTANCE_: this is the Euclidean distance (calculated from the standardized variables unless STANDARDIZE=N is specified) between the two observations that have the same positive value of the _MATCHED_PAIR_ID_ variable. Researchers use their own criterion for deciding any maximum distance value that they will consider acceptable, if such a value exists.

USAGE TIPS

Use the DEBUG= option to control use of system options NOTES, MPRINT, and SYMBOLGEN. By default, DEBUG=NONOTES NOMPRINT NOSYMBOLGEN is used. The macro will turn the NOTES back on when it is finished.

The macro does very little error-checking. The data that is used as input must be set up as described, or the macro might not work properly.

Types of Matching

1:1 matching - this is the matching that the macro performs.
Random matching - if you want only to randomly select from the controls without considering any sort of distance, then follow these steps:
1. create a numeric variable in the data set and set it equal to a constant
2. sort the data set in random order
3. run the macro and list the created variable on the NUMERIC_VARIABLES= option. Within each grouping, the case will be matched with the first control. If the data set is sorted in random order, then the matching will be random.
1:M matching - The macro does not do this kind of matching, however you might be able to get satisfactory results if you duplicate each case in the data set so that each case appears M times (you still need to have an unique id value for each), and then run the macro. Subset the data set that will be created by the macro: remove the duplicates and also assign the same _MATCHED_PAIR_ID_ value to each group of observations that you want to have matched together. This sample does not include an example.
N:M matching - The macro does not do this, however experimenting with the 1:M scenario and generalizing / subsetting might give satisfactory results. This sample does not include an example.

MISSING VALUES

Do not allow missing values for the CATEGORICAL_VARIABLES. Replace missing character-variable values with non-blank values before using the macro.

Observations that have missing values for the NUMERIC_VARIABLES will be treated as described in The FASTCLUS Procedure (SAS/STAT User's Guide).

DETAILS:

Definitions

CASE - an observation for which a near-match observation is desired.

CONTROL - an non-case observation.

GROUPING - a set of values of the character variable(s). If there are no character variables specified, then the entire data set is a grouping.

The algorithm works as follows.

************* beginning of algorithm ******************

Step 1) a CASE is selected. The selection is done according to the order in the data set.

Step 2) if there are categorical variables specified, consider only the controls that have the same categorical values as the CASE. If no controls qualify, then assign the case a value of 0 for the variable _MATCHED_PAIR_ID_, do not consider it again, go back to Step 1 to get the next CASE. If there are no categorical variables specified, then all controls are considered.

Step 3) compute the Euclidean distance between the case and each control based on the values of the standardized numeric variables. The distance is determined by standardizing the variables within the grouping of observations being considered, and then running PROC FASTCLUS using the case as the seed. For details see SAS/STAT User's Guide, the FASTCLUS Procedure.

Step 4) select a control that has the smallest Euclidean distance from the case.

Step 5) assign a non-zero value to the variable _MATCHED_PAIR_ID_. Remove the control and case from further consideration.

Step 6) go to Step 1 and repeat the process until there are no more cases to select.

Step 7) create an output data set that is a copy of the INPUT_DATA_SET= data set along with two new variables, _MATCHED_PAIR_ID_ and _DISTANCE_.

*********** end of algorithm *****************

Note that for a given grouping, Step 1 selects the cases in the order that they appear in the data set. If there is only one case for a given set of controls, then the macro will always find a closest control. If there are multiple cases for a given set of controls, then it is possible that a different ordering of the cases will result in different matches. For example, suppose that a data set has two cases with X values 4,6 and has two controls with X values 1,5. If the order of the controls in the data set is 4 followed by 6, then 4 will match with 5, and 6 will match with 1. However, if the order of the cases in the data set is 6 followed by 4, then 6 will match with 5, and 4 will match with 1.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

The following example assumes that the FIND_NEIGHBORS.SAS file was saved in the following location:

C:\temp\find_neighbors.sas

The PROC LOGISTIC code is a hypothetical illustration only.

Results are given in the Output tab.

%include 'C:\temp\find_neighbors.sas';

options nodate pageno=1;

data my_data;
     input ID $ && CASE GENDER AGE RACE;
     cards;
01 a      1        1       10      1
02 b      1        2       20      2
03 c      1        1       20      3
04 d      1        2       32      1
05 e      1        1       33      2
06 f      1        2       53      3
07 g      1        1       54      1
08 h      1        2       88      2
09 i      1        1       90      3
10 j      1        2       11      1
11 k      0        1        9      2
12 l      0        2       18      3
13 m      0        1       19      1
14 n      0        2       31      2
15 p      0        1       34      3
16 q      0        2       52      1
17 r      0        1      100      2
18 s      0        2       85      3
19 t      0        1       10      1
20 u      0        2       12      2
21 v      0        1       33      3
22 w      0        2       43      1
23 x      0        1       33      2
24 y      0        2       31      3
25 z      0        1       75      1
26 a      1        2       50      4
;

title1 'example 1: finding matches based on AGE within GENDER and RACE';
%find_neighbors(
     input_data_set=my_data,
     numeric_variables=age,
     categorical_variables=gender race,
     id_variable=id);

proc sort data=my_data_2;
     by _matched_pair_id_;
     run;
title2 'all data';
proc print data=my_data_2 uniform noobs;
     run;

title2 'observations for which a match was found';
title3 '_matched_pair_id_ >= 1';
proc print data=my_data_2(where=(_matched_pair_id_ ge 1))
         uniform noobs;
     run;

title2 'matches with _distance_ <= 1.9';
proc logistic data=my_data_2 descending;
     strata _matched_pair_id_;
     model case = age;
     where _distance_ le 1.9;
	  ods select ModelInfo ResponseProfile;
     run;



title1 'example 2: finding matches at random within GENDER and RACE';
data temp;
  set my_data;
  constant=1;
  s1=ranuni(34343);
  s2=ranuni(22211);
  run;
proc sort data=temp out=sorted;
  by s1;
  run;
%find_neighbors(
    input_data_set=sorted,
    numeric_variables=constant,
    categorical_variables=gender race,
    id_variable=id);
proc sort data=sorted_2;
  by _matched_pair_id_;
  run;
title2 'all data';
proc print data=sorted_2 uniform noobs;
    run;


title1 'example 3: random matches with no categorical variables';
proc sort data=temp out=sorted;
       by s1;
       run;
%find_neighbors(
   input_data_set=sorted,
    numeric_variables=constant,
    id_variable=id);
proc sort data=sorted_2;
  by _matched_pair_id_;
  run;
proc print data=sorted_2 uniform noobs;
  run;


title1 'example 4: random matches with no categorical variables';
title2 'different randomization';
proc sort data=temp out=sorted;
       by s2;
       run;
%find_neighbors(
    input_data_set=sorted,
    numeric_variables=constant,
    id_variable=id);
proc sort data=sorted_2;
  by _matched_pair_id_;
  run;
proc print data=sorted_2 uniform noobs;
  run;

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

     example 1: finding matches based on AGE within GENDER and RACE    1
                                all data

                                             _matched_
     ID     CASE    GENDER    AGE    RACE     pair_id_    _distance_

    11 k      0        1        9      2          .          .
    12 l      0        2       18      3          .          .
    13 m      0        1       19      1          .          .
    17 r      0        1      100      2          .          .
    18 s      0        2       85      3          .          .
    26 a      1        2       50      4          0          .
    01 a      1        1       10      1          1         0.00000
    19 t      0        1       10      1          1         0.00000
    07 g      1        1       54      1          2         0.71431
    25 z      0        1       75      1          2         0.71431
    05 e      1        1       33      2          3         0.00000
    23 x      0        1       33      2          3         0.00000
    03 c      1        1       20      3          4         0.41721
    21 v      0        1       33      3          4         0.41721
    09 i      1        1       90      3          5         1.79720
    15 p      0        1       34      3          5         1.79720
    04 d      1        2       32      1          6         0.62242
    22 w      0        2       43      1          6         0.62242
    10 j      1        2       11      1          7         2.31993
    16 q      0        2       52      1          7         2.31993
    02 b      1        2       20      2          8         0.23260
    20 u      0        2       12      2          8         0.23260
    08 h      1        2       88      2          9         1.65729
    14 n      0        2       31      2          9         1.65729
    06 f      1        2       53      3         10         0.75067
    24 y      0        2       31      3         10         0.75067
 
     example 1: finding matches based on AGE within GENDER and RACE    2
                observations for which a match was found
                         _matched_pair_id_ >= 1

                                             _matched_
     ID     CASE    GENDER    AGE    RACE     pair_id_    _distance_

    01 a      1        1       10      1          1         0.00000
    19 t      0        1       10      1          1         0.00000
    07 g      1        1       54      1          2         0.71431
    25 z      0        1       75      1          2         0.71431
    05 e      1        1       33      2          3         0.00000
    23 x      0        1       33      2          3         0.00000
    03 c      1        1       20      3          4         0.41721
    21 v      0        1       33      3          4         0.41721
    09 i      1        1       90      3          5         1.79720
    15 p      0        1       34      3          5         1.79720
    04 d      1        2       32      1          6         0.62242
    22 w      0        2       43      1          6         0.62242
    10 j      1        2       11      1          7         2.31993
    16 q      0        2       52      1          7         2.31993
    02 b      1        2       20      2          8         0.23260
    20 u      0        2       12      2          8         0.23260
    08 h      1        2       88      2          9         1.65729
    14 n      0        2       31      2          9         1.65729
    06 f      1        2       53      3         10         0.75067
    24 y      0        2       31      3         10         0.75067
 
     example 1: finding matches based on AGE within GENDER and RACE    3
                     matches with _distance_ <= 1.9

                         The LOGISTIC Procedure

                          Conditional Analysis

                           Model Information

        Data Set                           WORK.MY_DATA_2
        Response Variable                  CASE
        Number of Response Levels          2
        Number of Strata                   10
        Number of Uninformative Strata     1
        Frequency Uninformative            1
        Model                              binary logit
        Optimization Technique             Newton-Raphson ridge


                            Response Profile

                   Ordered                      Total
                     Value         CASE     Frequency

                         1            1            10
                         2            0             9

                     Probability modeled is CASE=1.
 
      example 2: finding matches at random within GENDER and RACE      4
                                all data

                                                   _matched_
 ID  CASE GENDER AGE RACE constant    s1      s2    pair_id_ _distance_

11 k   0     1     9   2      1    0.10785 0.54946      .         .
12 l   0     2    18   3      1    0.42389 0.18915      .         .
13 m   0     1    19   1      1    0.03210 0.38775      .         .
17 r   0     1   100   2      1    0.66846 0.76987      .         .
24 y   0     2    31   3      1    0.54653 0.50766      .         .
26 a   1     2    50   4      1    0.00079 0.33105      0         .
07 g   1     1    54   1      1    0.12082 0.31992      1         0
19 t   0     1    10   1      1    0.57029 0.63518      1         0
01 a   1     1    10   1      1    0.16954 0.75676      2         0
25 z   0     1    75   1      1    0.19005 0.54433      2         0
05 e   1     1    33   2      1    0.07624 0.09508      3         0
23 x   0     1    33   2      1    0.86975 0.73064      3         0
09 i   1     1    90   3      1    0.30667 0.25331      4         0
21 v   0     1    33   3      1    0.00596 0.35512      4         0
03 c   1     1    20   3      1    0.59733 0.89202      5         0
15 p   0     1    34   3      1    0.59736 0.77014      5         0
10 j   1     2    11   1      1    0.03567 0.94771      6         0
22 w   0     2    43   1      1    0.32895 0.40209      6         0
04 d   1     2    32   1      1    0.53374 0.88884      7         0
16 q   0     2    52   1      1    0.12105 0.28501      7         0
08 h   1     2    88   2      1    0.07034 0.53231      8         0
20 u   0     2    12   2      1    0.95277 0.65824      8         0
02 b   1     2    20   2      1    0.16614 0.52768      9         0
14 n   0     2    31   2      1    0.46180 0.89384      9         0
06 f   1     2    53   3      1    0.85767 0.28532     10         0
18 s   0     2    85   3      1    0.70499 0.52421     10         0
 
        example 3: random matches with no categorical variables        5

                                                   _matched_
 ID  CASE GENDER AGE RACE constant    s1      s2    pair_id_ _distance_

17 r   0     1   100   2      1    0.66846 0.76987      .         .
18 s   0     2    85   3      1    0.70499 0.52421      .         .
20 u   0     2    12   2      1    0.95277 0.65824      .         .
23 x   0     1    33   2      1    0.86975 0.73064      .         .
21 v   0     1    33   3      1    0.00596 0.35512      1         0
26 a   1     2    50   4      1    0.00079 0.33105      1         0
10 j   1     2    11   1      1    0.03567 0.94771      2         0
13 m   0     1    19   1      1    0.03210 0.38775      2         0
08 h   1     2    88   2      1    0.07034 0.53231      3         0
11 k   0     1     9   2      1    0.10785 0.54946      3         0
05 e   1     1    33   2      1    0.07624 0.09508      4         0
16 q   0     2    52   1      1    0.12105 0.28501      4         0
07 g   1     1    54   1      1    0.12082 0.31992      5         0
25 z   0     1    75   1      1    0.19005 0.54433      5         0
02 b   1     2    20   2      1    0.16614 0.52768      6         0
22 w   0     2    43   1      1    0.32895 0.40209      6         0
01 a   1     1    10   1      1    0.16954 0.75676      7         0
12 l   0     2    18   3      1    0.42389 0.18915      7         0
09 i   1     1    90   3      1    0.30667 0.25331      8         0
14 n   0     2    31   2      1    0.46180 0.89384      8         0
04 d   1     2    32   1      1    0.53374 0.88884      9         0
24 y   0     2    31   3      1    0.54653 0.50766      9         0
03 c   1     1    20   3      1    0.59733 0.89202     10         0
19 t   0     1    10   1      1    0.57029 0.63518     10         0
06 f   1     2    53   3      1    0.85767 0.28532     11         0
15 p   0     1    34   3      1    0.59736 0.77014     11         0
 
        example 4: random matches with no categorical variables        6
                        different randomization

                                                   _matched_
 ID  CASE GENDER AGE RACE constant    s1      s2    pair_id_ _distance_

14 n   0     2    31   2      1    0.46180 0.89384      .         .
15 p   0     1    34   3      1    0.59736 0.77014      .         .
17 r   0     1   100   2      1    0.66846 0.76987      .         .
23 x   0     1    33   2      1    0.86975 0.73064      .         .
05 e   1     1    33   2      1    0.07624 0.09508      1         0
12 l   0     2    18   3      1    0.42389 0.18915      1         0
09 i   1     1    90   3      1    0.30667 0.25331      2         0
16 q   0     2    52   1      1    0.12105 0.28501      2         0
06 f   1     2    53   3      1    0.85767 0.28532      3         0
21 v   0     1    33   3      1    0.00596 0.35512      3         0
07 g   1     1    54   1      1    0.12082 0.31992      4         0
13 m   0     1    19   1      1    0.03210 0.38775      4         0
22 w   0     2    43   1      1    0.32895 0.40209      5         0
26 a   1     2    50   4      1    0.00079 0.33105      5         0
02 b   1     2    20   2      1    0.16614 0.52768      6         0
24 y   0     2    31   3      1    0.54653 0.50766      6         0
08 h   1     2    88   2      1    0.07034 0.53231      7         0
18 s   0     2    85   3      1    0.70499 0.52421      7         0
01 a   1     1    10   1      1    0.16954 0.75676      8         0
25 z   0     1    75   1      1    0.19005 0.54433      8         0
04 d   1     2    32   1      1    0.53374 0.88884      9         0
11 k   0     1     9   2      1    0.10785 0.54946      9         0
03 c   1     1    20   3      1    0.59733 0.89202     10         0
19 t   0     1    10   1      1    0.57029 0.63518     10         0
10 j   1     2    11   1      1    0.03567 0.94771     11         0
20 u   0     2    12   2      1    0.95277 0.65824     11         0

Date Modified:	2017-12-01 07:51:38
Date Created:	2010-09-24 14:53:36

Product Family	Product	Host	SAS Release
Product Family	Product	Host	Starting	Ending
SAS System	SAS/STAT	z/OS	9 TS M0
		All	n/a	n/a
		Microsoft® Windows® for 64-Bit Itanium-based Systems	9 TS M0
		Microsoft Windows Server 2003 Datacenter 64-bit Edition	9 TS M0
		Microsoft Windows Server 2003 Enterprise 64-bit Edition	9 TS M0
		Microsoft Windows 2000 Advanced Server	9 TS M0
		Microsoft Windows 2000 Datacenter Server	9 TS M0
		Microsoft Windows 2000 Server	9 TS M0
		Microsoft Windows 2000 Professional	9 TS M0
		Microsoft Windows NT Workstation	9 TS M0
		Microsoft Windows Server 2003 Datacenter Edition	9 TS M0
		Microsoft Windows Server 2003 Enterprise Edition	9 TS M0
		Microsoft Windows Server 2003 Standard Edition	9 TS M0
		Microsoft Windows XP Professional	9 TS M0
		64-bit Enabled AIX	9 TS M0
		64-bit Enabled HP-UX	9 TS M0
		64-bit Enabled Solaris	9 TS M0
		HP-UX IPF	9 TS M0
		Linux	9 TS M0
		OpenVMS Alpha	9 TS M0
		Tru64 UNIX	9 TS M0

Support

Sample 41018: Finding near matches for case-control data using the FIND_NEIGHBORS macro

This macro is obsolete. For current functionality see the PSMATCH procedure in SAS/STAT® software.

Finding near matches for case-control data using the FIND_NEIGHBORS macro

Operating System and Release Information