Contents: | Purpose / Requirements / Usage / Details / |
The purpose of this macro is to find pairs of observations that are as similar as possible so that the pairs can be used in a case-control analysis. At least one numeric variable is required. Categorical variables can be used in addition. Base SAS® and SAS/STAT® software must be installed.
Base SAS and SAS/STAT software are required.
Follow the instructions in the Downloads tab of this sample to save the %FIND_NEIGHBORS macro definition. Before invoking the %FIND_NEIGHBORS macro, specify the following %INCLUDE statement in your SAS program or in the SAS editor window to define the macro and make it available for use. In the %INCLUDE statement, replace the text that is within quotes with the location of the %FIND_NEIGHBORS macro definition file that you save on your system.
%include "<location of your file containing the FIND_NEIGHBORS macro>"
Following that statement, you can invoke the %FIND_NEIGHBORS macro using this syntax:
%FIND_NEIGHBORS(<list of macro arguments separated by commas>)
For an example, see the Full Code tab.
HOW TO SET UP THE DATA AND INVOKE THE MACRO
In order to use the macro, you must create an input dataset that has these properties:
Use the NUMERIC_VARIABLES= option to list the numeric variable(s) that will be used to compute the Euclidean distance between observations. Specifying NUMERIC_VARIABLES= is required.
If there are variables that define categories that you want used for matching, then list the variables with the CATEGORICAL_VARIABLES= option. By default, no categorical variables are used.
By default, the macro will automatically standardize the NUMERIC_VARIABLES within each grouping. For example, if you are running the macro to find matches that are adjusted for categorical variables such as US state and gender (CATEGORICAL_VARIABLES=STATE GENDER), then the macro will standardize within each of the 100 (50x2) groupings. By default STANDARDIZE=Y. If you want to standardize the data in some other way prior to running the macro, then turn off the macro standardization by specifying STANDARDIZE=N. Some form of standardization is recommended. For details, see the chapters that document the clustering procedures in SAS/STAT User's Guide.
The name of the output data set that is created by the macro will be the same name as the INPUT_DATA_SET= data set with '_2' appended on the end. The two new variables are:
USAGE TIPS
Use the DEBUG= option to control use of system options NOTES, MPRINT, and SYMBOLGEN. By default, DEBUG=NONOTES NOMPRINT NOSYMBOLGEN is used. The macro will turn the NOTES back on when it is finished.
The macro does very little error-checking. The data that is used as input must be set up as described, or the macro might not work properly.
Types of Matching
MISSING VALUES
Do not allow missing values for the CATEGORICAL_VARIABLES. Replace missing character-variable values with non-blank values before using the macro.
Observations that have missing values for the NUMERIC_VARIABLES will be treated as described in The FASTCLUS Procedure (SAS/STAT User's Guide).
Definitions
CASE - an observation for which a near-match observation is desired.
CONTROL - an non-case observation.
GROUPING - a set of values of the character variable(s). If there are no character variables specified, then the entire data set is a grouping.
The algorithm works as follows.
************* beginning of algorithm ******************
Step 1) a CASE is selected. The selection is done according to the order in the data set.
Step 2) if there are categorical variables specified, consider only the controls that have the same categorical values as the CASE. If no controls qualify, then assign the case a value of 0 for the variable _MATCHED_PAIR_ID_, do not consider it again, go back to Step 1 to get the next CASE. If there are no categorical variables specified, then all controls are considered.
Step 3) compute the Euclidean distance between the case and each control based on the values of the standardized numeric variables. The distance is determined by standardizing the variables within the grouping of observations being considered, and then running PROC FASTCLUS using the case as the seed. For details see SAS/STAT User's Guide, the FASTCLUS Procedure.
Step 4) select a control that has the smallest Euclidean distance from the case.
Step 5) assign a non-zero value to the variable _MATCHED_PAIR_ID_. Remove the control and case from further consideration.
Step 6) go to Step 1 and repeat the process until there are no more cases to select.
Step 7) create an output data set that is a copy of the INPUT_DATA_SET= data set along with two new variables, _MATCHED_PAIR_ID_ and _DISTANCE_.
*********** end of algorithm *****************
Note that for a given grouping, Step 1 selects the cases in the order that they appear in the data set. If there is only one case for a given set of controls, then the macro will always find a closest control. If there are multiple cases for a given set of controls, then it is possible that a different ordering of the cases will result in different matches. For example, suppose that a data set has two cases with X values 4,6 and has two controls with X values 1,5. If the order of the controls in the data set is 4 followed by 6, then 4 will match with 5, and 6 will match with 1. However, if the order of the cases in the data set is 6 followed by 4, then 6 will match with 5, and 4 will match with 1.
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.
The following example assumes that the FIND_NEIGHBORS.SAS file was saved in the following location:
The PROC LOGISTIC code is a hypothetical illustration only.
Results are given in the Output tab.
%include 'C:\temp\find_neighbors.sas';
options nodate pageno=1;
data my_data;
input ID $ && CASE GENDER AGE RACE;
cards;
01 a 1 1 10 1
02 b 1 2 20 2
03 c 1 1 20 3
04 d 1 2 32 1
05 e 1 1 33 2
06 f 1 2 53 3
07 g 1 1 54 1
08 h 1 2 88 2
09 i 1 1 90 3
10 j 1 2 11 1
11 k 0 1 9 2
12 l 0 2 18 3
13 m 0 1 19 1
14 n 0 2 31 2
15 p 0 1 34 3
16 q 0 2 52 1
17 r 0 1 100 2
18 s 0 2 85 3
19 t 0 1 10 1
20 u 0 2 12 2
21 v 0 1 33 3
22 w 0 2 43 1
23 x 0 1 33 2
24 y 0 2 31 3
25 z 0 1 75 1
26 a 1 2 50 4
;
title1 'example 1: finding matches based on AGE within GENDER and RACE';
%find_neighbors(
input_data_set=my_data,
numeric_variables=age,
categorical_variables=gender race,
id_variable=id);
proc sort data=my_data_2;
by _matched_pair_id_;
run;
title2 'all data';
proc print data=my_data_2 uniform noobs;
run;
title2 'observations for which a match was found';
title3 '_matched_pair_id_ >= 1';
proc print data=my_data_2(where=(_matched_pair_id_ ge 1))
uniform noobs;
run;
title2 'matches with _distance_ <= 1.9';
proc logistic data=my_data_2 descending;
strata _matched_pair_id_;
model case = age;
where _distance_ le 1.9;
ods select ModelInfo ResponseProfile;
run;
title1 'example 2: finding matches at random within GENDER and RACE';
data temp;
set my_data;
constant=1;
s1=ranuni(34343);
s2=ranuni(22211);
run;
proc sort data=temp out=sorted;
by s1;
run;
%find_neighbors(
input_data_set=sorted,
numeric_variables=constant,
categorical_variables=gender race,
id_variable=id);
proc sort data=sorted_2;
by _matched_pair_id_;
run;
title2 'all data';
proc print data=sorted_2 uniform noobs;
run;
title1 'example 3: random matches with no categorical variables';
proc sort data=temp out=sorted;
by s1;
run;
%find_neighbors(
input_data_set=sorted,
numeric_variables=constant,
id_variable=id);
proc sort data=sorted_2;
by _matched_pair_id_;
run;
proc print data=sorted_2 uniform noobs;
run;
title1 'example 4: random matches with no categorical variables';
title2 'different randomization';
proc sort data=temp out=sorted;
by s2;
run;
%find_neighbors(
input_data_set=sorted,
numeric_variables=constant,
id_variable=id);
proc sort data=sorted_2;
by _matched_pair_id_;
run;
proc print data=sorted_2 uniform noobs;
run;
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.
example 1: finding matches based on AGE within GENDER and RACE 1 all data _matched_ ID CASE GENDER AGE RACE pair_id_ _distance_ 11 k 0 1 9 2 . . 12 l 0 2 18 3 . . 13 m 0 1 19 1 . . 17 r 0 1 100 2 . . 18 s 0 2 85 3 . . 26 a 1 2 50 4 0 . 01 a 1 1 10 1 1 0.00000 19 t 0 1 10 1 1 0.00000 07 g 1 1 54 1 2 0.71431 25 z 0 1 75 1 2 0.71431 05 e 1 1 33 2 3 0.00000 23 x 0 1 33 2 3 0.00000 03 c 1 1 20 3 4 0.41721 21 v 0 1 33 3 4 0.41721 09 i 1 1 90 3 5 1.79720 15 p 0 1 34 3 5 1.79720 04 d 1 2 32 1 6 0.62242 22 w 0 2 43 1 6 0.62242 10 j 1 2 11 1 7 2.31993 16 q 0 2 52 1 7 2.31993 02 b 1 2 20 2 8 0.23260 20 u 0 2 12 2 8 0.23260 08 h 1 2 88 2 9 1.65729 14 n 0 2 31 2 9 1.65729 06 f 1 2 53 3 10 0.75067 24 y 0 2 31 3 10 0.75067 example 1: finding matches based on AGE within GENDER and RACE 2 observations for which a match was found _matched_pair_id_ >= 1 _matched_ ID CASE GENDER AGE RACE pair_id_ _distance_ 01 a 1 1 10 1 1 0.00000 19 t 0 1 10 1 1 0.00000 07 g 1 1 54 1 2 0.71431 25 z 0 1 75 1 2 0.71431 05 e 1 1 33 2 3 0.00000 23 x 0 1 33 2 3 0.00000 03 c 1 1 20 3 4 0.41721 21 v 0 1 33 3 4 0.41721 09 i 1 1 90 3 5 1.79720 15 p 0 1 34 3 5 1.79720 04 d 1 2 32 1 6 0.62242 22 w 0 2 43 1 6 0.62242 10 j 1 2 11 1 7 2.31993 16 q 0 2 52 1 7 2.31993 02 b 1 2 20 2 8 0.23260 20 u 0 2 12 2 8 0.23260 08 h 1 2 88 2 9 1.65729 14 n 0 2 31 2 9 1.65729 06 f 1 2 53 3 10 0.75067 24 y 0 2 31 3 10 0.75067 example 1: finding matches based on AGE within GENDER and RACE 3 matches with _distance_ <= 1.9 The LOGISTIC Procedure Conditional Analysis Model Information Data Set WORK.MY_DATA_2 Response Variable CASE Number of Response Levels 2 Number of Strata 10 Number of Uninformative Strata 1 Frequency Uninformative 1 Model binary logit Optimization Technique Newton-Raphson ridge Response Profile Ordered Total Value CASE Frequency 1 1 10 2 0 9 Probability modeled is CASE=1. example 2: finding matches at random within GENDER and RACE 4 all data _matched_ ID CASE GENDER AGE RACE constant s1 s2 pair_id_ _distance_ 11 k 0 1 9 2 1 0.10785 0.54946 . . 12 l 0 2 18 3 1 0.42389 0.18915 . . 13 m 0 1 19 1 1 0.03210 0.38775 . . 17 r 0 1 100 2 1 0.66846 0.76987 . . 24 y 0 2 31 3 1 0.54653 0.50766 . . 26 a 1 2 50 4 1 0.00079 0.33105 0 . 07 g 1 1 54 1 1 0.12082 0.31992 1 0 19 t 0 1 10 1 1 0.57029 0.63518 1 0 01 a 1 1 10 1 1 0.16954 0.75676 2 0 25 z 0 1 75 1 1 0.19005 0.54433 2 0 05 e 1 1 33 2 1 0.07624 0.09508 3 0 23 x 0 1 33 2 1 0.86975 0.73064 3 0 09 i 1 1 90 3 1 0.30667 0.25331 4 0 21 v 0 1 33 3 1 0.00596 0.35512 4 0 03 c 1 1 20 3 1 0.59733 0.89202 5 0 15 p 0 1 34 3 1 0.59736 0.77014 5 0 10 j 1 2 11 1 1 0.03567 0.94771 6 0 22 w 0 2 43 1 1 0.32895 0.40209 6 0 04 d 1 2 32 1 1 0.53374 0.88884 7 0 16 q 0 2 52 1 1 0.12105 0.28501 7 0 08 h 1 2 88 2 1 0.07034 0.53231 8 0 20 u 0 2 12 2 1 0.95277 0.65824 8 0 02 b 1 2 20 2 1 0.16614 0.52768 9 0 14 n 0 2 31 2 1 0.46180 0.89384 9 0 06 f 1 2 53 3 1 0.85767 0.28532 10 0 18 s 0 2 85 3 1 0.70499 0.52421 10 0 example 3: random matches with no categorical variables 5 _matched_ ID CASE GENDER AGE RACE constant s1 s2 pair_id_ _distance_ 17 r 0 1 100 2 1 0.66846 0.76987 . . 18 s 0 2 85 3 1 0.70499 0.52421 . . 20 u 0 2 12 2 1 0.95277 0.65824 . . 23 x 0 1 33 2 1 0.86975 0.73064 . . 21 v 0 1 33 3 1 0.00596 0.35512 1 0 26 a 1 2 50 4 1 0.00079 0.33105 1 0 10 j 1 2 11 1 1 0.03567 0.94771 2 0 13 m 0 1 19 1 1 0.03210 0.38775 2 0 08 h 1 2 88 2 1 0.07034 0.53231 3 0 11 k 0 1 9 2 1 0.10785 0.54946 3 0 05 e 1 1 33 2 1 0.07624 0.09508 4 0 16 q 0 2 52 1 1 0.12105 0.28501 4 0 07 g 1 1 54 1 1 0.12082 0.31992 5 0 25 z 0 1 75 1 1 0.19005 0.54433 5 0 02 b 1 2 20 2 1 0.16614 0.52768 6 0 22 w 0 2 43 1 1 0.32895 0.40209 6 0 01 a 1 1 10 1 1 0.16954 0.75676 7 0 12 l 0 2 18 3 1 0.42389 0.18915 7 0 09 i 1 1 90 3 1 0.30667 0.25331 8 0 14 n 0 2 31 2 1 0.46180 0.89384 8 0 04 d 1 2 32 1 1 0.53374 0.88884 9 0 24 y 0 2 31 3 1 0.54653 0.50766 9 0 03 c 1 1 20 3 1 0.59733 0.89202 10 0 19 t 0 1 10 1 1 0.57029 0.63518 10 0 06 f 1 2 53 3 1 0.85767 0.28532 11 0 15 p 0 1 34 3 1 0.59736 0.77014 11 0 example 4: random matches with no categorical variables 6 different randomization _matched_ ID CASE GENDER AGE RACE constant s1 s2 pair_id_ _distance_ 14 n 0 2 31 2 1 0.46180 0.89384 . . 15 p 0 1 34 3 1 0.59736 0.77014 . . 17 r 0 1 100 2 1 0.66846 0.76987 . . 23 x 0 1 33 2 1 0.86975 0.73064 . . 05 e 1 1 33 2 1 0.07624 0.09508 1 0 12 l 0 2 18 3 1 0.42389 0.18915 1 0 09 i 1 1 90 3 1 0.30667 0.25331 2 0 16 q 0 2 52 1 1 0.12105 0.28501 2 0 06 f 1 2 53 3 1 0.85767 0.28532 3 0 21 v 0 1 33 3 1 0.00596 0.35512 3 0 07 g 1 1 54 1 1 0.12082 0.31992 4 0 13 m 0 1 19 1 1 0.03210 0.38775 4 0 22 w 0 2 43 1 1 0.32895 0.40209 5 0 26 a 1 2 50 4 1 0.00079 0.33105 5 0 02 b 1 2 20 2 1 0.16614 0.52768 6 0 24 y 0 2 31 3 1 0.54653 0.50766 6 0 08 h 1 2 88 2 1 0.07034 0.53231 7 0 18 s 0 2 85 3 1 0.70499 0.52421 7 0 01 a 1 1 10 1 1 0.16954 0.75676 8 0 25 z 0 1 75 1 1 0.19005 0.54433 8 0 04 d 1 2 32 1 1 0.53374 0.88884 9 0 11 k 0 1 9 2 1 0.10785 0.54946 9 0 03 c 1 1 20 3 1 0.59733 0.89202 10 0 19 t 0 1 10 1 1 0.57029 0.63518 10 0 10 j 1 2 11 1 1 0.03567 0.94771 11 0 20 u 0 2 12 2 1 0.95277 0.65824 11 0
Right-click on the link below to save the %FIND_NEIGHBORS macro definition to a file. Name the file find_neighbors.sas.
find_neighbors.sasThe location that you specify to save the file is the location that you will specify on the %INCLUDE statement (see the Usage section under the Details tab).
Type: | Sample |
Date Modified: | 2017-12-01 07:51:38 |
Date Created: | 2010-09-24 14:53:36 |
Product Family | Product | Host | SAS Release | |
Starting | Ending | |||
SAS System | SAS/STAT | z/OS | 9 TS M0 | |
All | n/a | n/a | ||
Microsoft® Windows® for 64-Bit Itanium-based Systems | 9 TS M0 | |||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | 9 TS M0 | |||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | 9 TS M0 | |||
Microsoft Windows 2000 Advanced Server | 9 TS M0 | |||
Microsoft Windows 2000 Datacenter Server | 9 TS M0 | |||
Microsoft Windows 2000 Server | 9 TS M0 | |||
Microsoft Windows 2000 Professional | 9 TS M0 | |||
Microsoft Windows NT Workstation | 9 TS M0 | |||
Microsoft Windows Server 2003 Datacenter Edition | 9 TS M0 | |||
Microsoft Windows Server 2003 Enterprise Edition | 9 TS M0 | |||
Microsoft Windows Server 2003 Standard Edition | 9 TS M0 | |||
Microsoft Windows XP Professional | 9 TS M0 | |||
64-bit Enabled AIX | 9 TS M0 | |||
64-bit Enabled HP-UX | 9 TS M0 | |||
64-bit Enabled Solaris | 9 TS M0 | |||
HP-UX IPF | 9 TS M0 | |||
Linux | 9 TS M0 | |||
OpenVMS Alpha | 9 TS M0 | |||
Tru64 UNIX | 9 TS M0 |