Example 54.5 Stratified Sampling

Consider the hypothetical example in Fleiss (1981, pp. 6–7), in which a test is applied to a sample of 1,000 people known to have a disease and to another sample of 1,000 people known not to have the same disease. In the diseased sample, 950 test positive; in the nondiseased sample, only 10 test positive. If the true disease rate in the population is 1 in 100, specifying PEVENT=0.01 results in the correct false positive and negative rates for the stratified sampling scheme. Omitting the PEVENT= option is equivalent to using the overall sample disease rate (1000/2000 = 0.5) as the value of the PEVENT= option, which would ignore the stratified sampling.

The statements to create the data set and perform the analysis are as follows:

data Screen;
do Disease='Present','Absent';
do Test=1,0;
input Count @@;
output;
end;
end;
datalines;
950  50
10 990
;
proc logistic data=Screen;
freq Count;
model Disease(event='Present')=Test
/ pevent=.5 .01 ctable pprob=.5;
run;

The response variable option EVENT= indicates that Disease=’Present’ is the event. The CTABLE option is specified to produce a classification table. Specifying PPROB=0.5 indicates a cutoff probability of 0.5. A list of two probabilities, 0.5 and 0.01, is specified for the PEVENT= option; 0.5 corresponds to the overall sample disease rate, and 0.01 corresponds to a true disease rate of 1 in 100.

The classification table is shown in Output 54.5.1.

Output 54.5.1: False Positive and False Negative Rates

The LOGISTIC Procedure

Classification Table
Prob
Event
Prob
Level
Correct Incorrect Percentages
Event Non-
Event
Event Non-
Event
Correct Sensi-
tivity
Speci-
ficity
False
POS
False
NEG
0.500 0.500 950 990 10 50 97.0 95.0 99.0 1.0 4.8
0.010 0.500 950 990 10 50 99.0 95.0 99.0 51.0 0.1

In the classification table, the column Prob Level represents the cutoff values (the settings of the PPROB= option) for predicting whether an observation is an event. The Correct columns list the numbers of subjects that are correctly predicted as events and nonevents, respectively, and the Incorrect columns list the number of nonevents incorrectly predicted as events and the number of events incorrectly predicted as nonevents, respectively. For PEVENT=0.5, the false positive rate is 1% and the false negative rate is 4.8%. These results ignore the fact that the samples were stratified and incorrectly assume that the overall sample proportion of disease (which is 0.5) estimates the true disease rate. For a true disease rate of 0.01, the false positive rate and the false negative rate are 51% and 0.1%, respectively, as shown in the second line of the classification table.