SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 22561: Testing the equality of two or more proportions from independent samples

DetailsAboutRate It

You can test the equality of two proportions obtained from independent samples using the Pearson chi-square test. Specify the CHISQ option in the TABLES statement of PROC FREQ to compute this test. This is equivalent to the well-known Z test for comparing two independent proportions.

Suppose you want to compare the proportions responding Yes to a question in independent samples of 100 men and 100 women. The number of men responding Yes is observed to be 30 and the number of women responding Yes was 45. The data can be arranged in a 2×2 table:

                     Response
                     Yes  No  Total
       Gender    Men  30  70  100
               Women  45  55  100

It is assumed that each man in the sample of men had the same probability of responding yes. The same is assumed for the sample of women, though of course, the two probabilities may differ.

To conduct the test, begin by entering the data in a DATA step. In the statements below, there is one line of data giving the number of men responding Yes and the total sample size for men. A second line contains the same information for the sample of women. The programming statements create the Response variable with values Yes and No, create the Count variable computing the count for the No level as total minus number responding Yes, and output an observation for each Response level. The PROC PRINT step shows the resulting data set which has one observation for each cell of the table.

      data YesNo;
        input Gender $ NumYes Total;
        Response="Yes"; Count=NumYes;       output;
        Response="No "; Count=Total-NumYes; output;
        datalines;
      Men   30 100
      Women 45 100
      ;
      
      proc print noobs; 
        var Gender Response Count;
        run;
Gender Response Count
Men Yes 30
Men No 70
Women Yes 45
Women No 55

You can now use PROC FREQ to display the data in table form and compute the chi-square test comparing the proportions. The WEIGHT statement specifies the variable containing the cell counts of the table by entering these cell counts. In the TABLE statement, the first variable defines the rows of the table and the second variable defines the columns. The ORDER=DATA option, while not necessary to display the table and conduct a proper analysis, assures that the levels defining the rows and columns appear in the same order as they are encountered in when reading the data set. So, Men appears before Women and Yes appears before No. The CHISQ option requests the chi-square test.

      proc freq order=data;
        weight Count;
        table Gender * Response / chisq riskdiff;
        run;

In the results, a table of statistics includes the Pearson chi-square test (labeled "Chi-Square"). The small p-value for the test (p=0.0285) indicates that the null hypothesis of equal proportions can be rejected and that the proportions are unequal.

Note that the Pearson test is a test of the independence of the row and column variables — Gender and Response in this example. But this hypothesis can be shown to be equivalent to the hypothesis of equal row (or column) proportions.

Statistic DF Value Prob
Chi-Square 1 4.8000 0.0285
Likelihood Ratio Chi-Square 1 4.8247 0.0281
Continuity Adj. Chi-Square 1 4.1813 0.0409
Mantel-Haenszel Chi-Square 1 4.7760 0.0289
Phi Coefficient   -0.1549  
Contingency Coefficient   0.1531  
Cramer's V   -0.1549  

If you are interested in the difference in the probability of responding Yes between men and women, the RISKDIFF option provides an estimate as well as a confidence interval. Since Yes is in Column 1, the "Column 1 Risk Estimates" table provides the desired estimates. The estimated difference in probabilities (Men in Row 1 - Women in Row 2) is -0.15 with 95% confidence limits (-0.2826, -0.0174). Since these limits do not include zero as a likely value of the population mean difference, the difference is significant at the 0.05 level.

Column 1 Risk Estimates
  Risk ASE (Asymptotic) 95%
Confidence Limits
(Exact) 95%
Confidence Limits
Row 1 0.3000 0.0458 0.2102 0.3898 0.2124 0.3998
Row 2 0.4500 0.0497 0.3525 0.5475 0.3503 0.5527
Total 0.3750 0.0342 0.3079 0.4421 0.3077 0.4461
Difference -0.1500 0.0676 -0.2826 -0.0174    
Difference is (Row 1 - Row 2)

When the table is more complex, involving more variables, a logistic model is often fit to the data. In such models, the odds ratio is often used for comparisons. But differences in probabilities can also be estimated as described in this note.

Exact test and confidence interval

For the chi-square test to be valid, the cell counts must not be too small. The usual rule of thumb is that all cell counts should be at least 5, though this may be a little too stringent. When some cell counts are too small, you can use Fisher's exact test which is also provided by the CHISQ option. The Fisher test, while more conservative, also shows a significant difference between the proportions (p=0.0405).

Fisher's Exact Test
Cell (1,1) Frequency (F) 30
Left-sided Pr <= F 0.0203
Right-sided Pr >= F 0.9904
   
Table Probability (P) 0.0107
Two-sided Pr <= P 0.0405

An exact confidence interval of the difference in probabilities can also be obtained by using the RISKDIFF(CL=EXACT) option and the EXACT statement.

      proc freq order=data;
        weight Count;
        table Gender * Response / riskdiff(cl=exact);
        exact riskdiff;
        run;

Note that the exact method can require considerable time and memory. It should only be used when the large-sample confidence interval (above) is questionable due to small cell counts. There is no reason to question the validity of the large-sample interval in this table, and this is confirmed by the exact confidence interval being very similar to the large-sample interval.

Column 1 Risk Estimates
  Risk ASE (Asymptotic) 95%
Confidence Limits
(Exact) 95%
Confidence Limits
Row 1 0.3000 0.0458 0.2102 0.3898 0.2124 0.3998
Row 2 0.4500 0.0497 0.3525 0.5475 0.3503 0.5527
Total 0.3750 0.0342 0.3079 0.4421 0.3077 0.4461
Difference -0.1500 0.0676 -0.2826 -0.0174 -0.2889 -0.0066
Difference is (Row 1 - Row 2)

One-sided test

Notice that the Fisher test results include one-sided tests as well as the two-sided test that corresponds to the Pearson chi-square test. See this note for interpretation of the one-sided Fisher test results. As noted above, the Pearson chi-square statistic is equivalent to the Z test statistic which is also commonly used to test the equality of independent proportions. In fact, chi-square = Z2. Consequently, the p-value for the two-sided Z test is the same as for the chi-square test. And since the p-value for a two-sided test is double the one-sided p-value when the test statistic's distribution is symmetric (as is the normal distribution), you can get the one-sided p-value by simply halving the two-sided p-value. So, if your alternative hypothesis was that the proportion of men responding Yes is less than than the proportion of women responding Yes, then the p-value for this one-sided test would be 0.0285/2 = 0.0143.

Power and sample size determination

This note discusses assessing power and sample size needs and gives an example.

Comparing more than two proportions

You and compare more than two proportions in the same way as above — simply add a line in the DATA step for each proportion. The chi-square test will now have more than one degree of freedom and a small p-value tells you that at least two of the proportions differ. See this note for how to do multiple comparisons to see which proportions differ.



Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATAlln/a
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.