![]() | ![]() | ![]() |
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of stratified sampling, so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect, such as the predicted event probabilities, differences or ratios of event probabilities (the ratio is called the relative risk), and false positive and negative rates. If you know the probabilities of events and nonevents in the population, then you can adjust the intercept either by weighting or by using an offset.
To adjust by weighting, add a variable to your data set that takes the value p1/r1 in event observations, and the value (1-p1)/(1-r1) in nonevent observations, where p1 is the probability of an event in the population and r1 is the proportion of events in your data set. Specify this variable in the WEIGHT statement in PROC LOGISTIC.
Or, to adjust by using an offset, add a variable to your data set defined as log[(r1*(1-p1)) / ((1-r1)*p1)], where log represents the natural logarithm. Specify this variable in the OFFSET= option of the MODEL statement in PROC LOGISTIC.
These two methods are not equivalent. The weighting method seems to perform better when the model is not correctly specified. For details, see Scott and Wild, "Fitting Logistic Models under Case-Control or Choice-Based Sampling," Journal of the Royal Statistical Society B (1986): 48.
The weighting method also has the advantage of directly providing properly adjusted predicted probabilities via the PREDICTED= or PREDPROBS= option in the OUTPUT statement. Prior to SAS 9, the offset method requires using PROC SCORE to obtain predicted logits, ignoring the offset, followed by a DATA step to transform the logits to probabilities. This is somewhat inefficient computationally, which could be a problem with large data sets. But beginning with SAS 9, you can use the PRIOR= or PRIOREVENT= option in the SCORE statement to input the population event and nonevent probabilities and obtain predicted probabilities that are equivalent to those produced by the offset method. These methods are illustrated below.
For more information about analyzing data from stratified sampling, see Chapter 6 of the Hosmer and Lemeshow reference or Chapter 7 of the Collett reference.
The following example illustrates obtaining predicted probabilities adjusted for oversampling. Data set FULL is created containing a binary response, Y (with event=1 and nonevent=0), and predictor, X. The true model from which the data is generated is logit(p) = -3.35 + 2*X, resulting in approximately a 0.1 overall proportion of events. A subset (data set SUB) of this data set is obtained by oversampling—all Y=1 observations and one-ninth of the Y=0 observations are retained resulting in a sample with approximately equal numbers of events and nonevents. The offset (OFF) and weights (W) are computed as discussed above using the known population and sample event probabilities. An unadjusted logistic regression and offset- and weight-adjusted logistic regressions are run yielding corrected intercepts. Predicted probabilities are computed as discussed above and a plot is presented of the true, unadjusted, offset-adjusted and weight-adjusted probabilities. A custom ODS graphics template is created using the TEMPLATE procedure, and this template is used in the final DATA step to produce the plot. ODS graphics require SAS 9.1 or later. However, a similar plot could be created with PROC GPLOT. All results are accumulated in data set OUT.
data full;
do i=1 to 1000;
x=rannor(12342);
p=1/(1+exp(-(-3.35+2*x)));
y=ranbin(98435,1,p);
drop i;
output;
end;
run;
data sub;
set full;
if y=1 or (y=0 and ranuni(75302)<1/9) then output;
run;
proc freq data=full;
table y / out=fullpct(where=(y=1) rename=(percent=fullpct));
title "response counts in full data set";
run;
| Response counts in full data set |
| y | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
|---|---|---|---|---|
| 0 | 900 | 90.00 | 900 | 90.00 |
| 1 | 100 | 10.00 | 1000 | 100.00 |
proc freq data=sub;
table y / out=subpct(where=(y=1) rename=(percent=subpct));
title "Response counts in oversampled, subset data set";
run;
| Response counts in oversampled, subset data set |
| y | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
|---|---|---|---|---|
| 0 | 102 | 50.50 | 102 | 50.50 |
| 1 | 100 | 49.50 | 202 | 100.00 |
data sub;
set sub;
if _n_=1 then set fullpct(keep=fullpct);
if _n_=1 then set subpct(keep=subpct);
p1=fullpct/100; r1=subpct/100;
w=p1/r1; if y=0 then w=(1-p1)/(1-r1);
off=log( (r1*(1-p1)) / ((1-r1)*p1) );
run;
ods select parameterestimates(persist);
proc logistic data=sub;
model y(event="1")=x;
output out=out p=pnowt;
title "True Parameters: -3.35 (intercept), 2 (X)";
title2 "Unadjusted Model";
run;
| True Parameters: -3.35 (intercept), 2 (X) |
| Unadjusted Model |
| Analysis of Maximum Likelihood Estimates | |||||
|---|---|---|---|---|---|
| Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq |
| Intercept | 1 | -1.2821 | 0.2677 | 22.9342 | <.0001 |
| x | 1 | 2.3149 | 0.3383 | 46.8238 | <.0001 |
proc logistic data=out;
model y(event="1")=x; weight w;
output out=out p=pwt;
title2 "Weight-adjusted Model";
run;
| True Parameters: -3.35 (intercept), 2 (X) |
| Weight-adjusted Model |
| Analysis of Maximum Likelihood Estimates | |||||
|---|---|---|---|---|---|
| Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq |
| Intercept | 1 | -3.5247 | 0.4985 | 50.0024 | <.0001 |
| x | 1 | 2.3938 | 0.4839 | 24.4690 | <.0001 |
proc logistic data=out;
model y(event="1")=x / offset=off;
output out=out xbeta=xboff;
title2 "Offset-adjusted Model";
run;
data out;
set out;
poff=logistic(xboff-off);
run;
| True Parameters: -3.35 (intercept), 2 (X) |
| Offset-adjusted Model |
| Analysis of Maximum Likelihood Estimates | |||||
|---|---|---|---|---|---|
| Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq |
| Intercept | 1 | -3.4596 | 0.2677 | 166.9771 | <.0001 |
| x | 1 | 2.3149 | 0.3383 | 46.8238 | <.0001 |
| off | 1 | 1.0000 | 0 | . | . |
proc freq data=full noprint;
table y / out=priors(drop=percent rename=(count=_prior_));
run;
proc logistic data=out;
model y(event="1")=x;
score data=sub prior=priors out=out2;
title2 "Unadjusted Model; Prior-adjusted probabilities";
run;
data out;
merge out out2;
drop _lev:;
run;
| True Parameters: -3.35 (intercept), 2 (X) |
| Unadjusted Model; Prior-adjusted probabilities |
| Analysis of Maximum Likelihood Estimates | |||||
|---|---|---|---|---|---|
| Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq |
| Intercept | 1 | -1.2821 | 0.2677 | 22.9342 | <.0001 |
| x | 1 | 2.3149 | 0.3383 | 46.8238 | <.0001 |
ods select all;
ods html;
ods graphics on;
proc template;
define statgraph LOGAdj / store = SASUSER.TEMPLAT;
notes "plot of various adjustments for oversampling";
layout overlay;
series y=p x=x / sort=x name="p"
legendlabel="True" linecolor=red;
series y=pnowt x=x / sort=x name="pnowt"
legendlabel="Unadjusted" linecolor=black;
series y=pwt x=x / sort=x name="pwt"
legendlabel="Weighted" linecolor=green;
series y=poff x=x / sort=x name="poff"
legendlabel="Offset" linecolor=brown;
series y=p_1 x=x / sort=x name="p_1"
legendlabel="Priors" linecolor=yellow;
discretelegend "p" "pnowt" "pwt" "poff" "p_1" /
halign=left valign=top across=3 down=3 border=on
title="Adjustments" titleborder=on;
endlayout;
end;
run;
title "Adjustments To Probabilities";
title2;
data _null_;
set out;
file print ods=(template='LOGAdj');
put _ods_;
run;
ods graphics off;
ods html close;
Note that the offset-adjusted probabilities match those from using the PRIOR= option in the SCORE statement. Both the offset and the weight adjustments come closer to the true probabilities than the unadjusted probabilities which are incorrect because of the unadjusted intercept. The offset adjustment seems slightly better than the weight adjustment in this case, possibly because the correct model is specified. With a larger initial data set (say, 10000) and sample (1000 events and nonevents), both adjustment methods yield very similar probabilities very close to the true probabilities.
| Product Family | Product | System | Reported Release | Fixed Release* |
| SAS System | SAS/STAT | All | n/a |
| Type: | Usage Note |
| Priority: | low |
| Topic: | SAS Reference ==> Procedures ==> LOGISTIC |
| Date Modified: | 2005-04-04 11:28:57 |
| Date Created: | 2002-12-16 10:56:38 |



