![]() | ![]() | ![]() |
When using observational data to compare the effects of the levels of some primary variable (such as treatments), the treated groups may differ on the response not because of differences in the treatments' effects (or not only because of them), but because of differences among the treated groups on other variables (such as age, gender etc). In designed studies, such biasing effects can be controlled by randomly assigning treatments to subjects. Such randomization balances the effects of the secondary variables and prevents bias.
One way to balance the effects of secondary variables is to create sets of subjects that are matched on these variables — that is, that have similar values on the variables. The task of creating these matched sets can be simplified by use of propensity scores as a proxy for the set of secondary variables.
Propensity scores are the predicted probabilities from a logistic model which models the probabilities of being in the various levels of the predictor of primary interest as a function of a set of secondary variables. For example, suppose you are primarily interested in the effect of smoking on the probability of lung cancer and you want to control for the effects of age, family history, and quality of diet. You obtain propensity scores by fitting a logistic model that estimates the probability of smoking given the secondary variables and extracting the predicted probabilities:
proc logistic;
class famhistory / param=ref;
model smoke(event="1") = age famhistory diet;
output out=preds predprobs=individual;
run;
The variable IP_1 in the PREDS data set contains the predicted probability of smoking for each subject. These are the propensity scores. You could then match a subject with lung cancer to a subject without cancer by selecting a noncancer patient with a similar propensity score. In the case of more than two levels of the primary predictor, use the LINK=GLOGIT option in the MODEL statement to fit a nominal logistic model. The PREDS data set would then contain variables holding the predicted probabilities for each level.
By matching subjects with respect to their propensity scores, they will be matched with respect to the secondary variables. And as a result of the matching, there will be little or no difference between the levels of the primary variable with respect to the secondary variables. That is, the smokers and nonsmokers in the example above will not differ on age, family history, or quality of diet. This allows you to omit the secondary variables from the model for lung cancer.
While there is no procedure or macro available from SAS Institute specifically designed to match observations using propensity scores, there have been several papers presented at SAS Global Forum (formerly, SUGI) that discuss this and some which present macros. You can search for papers on this topic at the SAS Global Forum Online Proceedings site. Also, see the matching macros available from the Mayo Clinic. Note that these macros are not supported by SAS Institute.
When the response of interest is binary, analysis of the data matched on propensity scores can be done using conditional logistic regression, generalized estimating equations (GEE), or random effects modeling. See the section on Matching in Paul Allison's book, Logistic Regression Using SAS: Theory and Application, Second Edition which discusses and illustrates the use of propensity scores and the analysis of data matched on propensity scores.
An alternative to matching on the propensity scores and adopting an analysis that accounts for the matching is to use the propensity scores in a logistic model along with the primary predictors of interest. To allow for a possibly nonlinear association of the propensity scores with the response, categories of the propensity scores (such as quantiles) may be used in the model or a spline or loess smooth of the scores allowing several degrees of freedom could be used in a generalized additive model fit using PROC GAM.
| Product Family | Product | System | SAS Release | |
| Reported | Fixed* | |||
| SAS System | SAS/STAT | z/OS | ||
| OpenVMS VAX | ||||
| Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
| Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
| Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
| Microsoft Windows XP 64-bit Edition | ||||
| Microsoft® Windows® for x64 | ||||
| OS/2 | ||||
| Microsoft Windows 95/98 | ||||
| Microsoft Windows 2000 Advanced Server | ||||
| Microsoft Windows 2000 Datacenter Server | ||||
| Microsoft Windows 2000 Server | ||||
| Microsoft Windows 2000 Professional | ||||
| Microsoft Windows NT Workstation | ||||
| Microsoft Windows Server 2003 Datacenter Edition | ||||
| Microsoft Windows Server 2003 Enterprise Edition | ||||
| Microsoft Windows Server 2003 Standard Edition | ||||
| Microsoft Windows XP Professional | ||||
| Windows Millennium Edition (Me) | ||||
| Windows Vista | ||||
| 64-bit Enabled AIX | ||||
| 64-bit Enabled HP-UX | ||||
| 64-bit Enabled Solaris | ||||
| ABI+ for Intel Architecture | ||||
| AIX | ||||
| HP-UX | ||||
| HP-UX IPF | ||||
| IRIX | ||||
| Linux | ||||
| Linux on Itanium | ||||
| OpenVMS Alpha | ||||
| Solaris | ||||
| Solaris for x64 | ||||
| Tru64 UNIX | ||||
| Type: | Usage Note |
| Priority: | |
| Topic: | SAS Reference ==> Procedures ==> LOGISTIC Analytics ==> Analysis of Variance Analytics ==> Categorical Data Analysis Analytics ==> Regression |
| Date Modified: | 2008-01-18 14:10:39 |
| Date Created: | 2008-01-18 14:06:14 |




