SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 32609: Testing hypotheses or getting confidence intervals for the proportions of a multinomial variable

DetailsAboutRate It

Suppose each subject in a sample of many subjects is categorized into one level of a multinomial variable and you want to test one or more hypotheses concerning the probabilities associated with the levels. You also want to get confidence intervals for probabilities of each of the levels.

This can be done in PROC CATMOD using the RESPONSE and CONTRAST statements. Also, 1 Berry and Hurtado (1994) [POSTSCRIPT] presented the MULTINOM SAS/IML module which gives confidence intervals for linear combinations (such as pairwise differences) among such dependent proportions. A slight modification of MULTINOM provides a test statistic and associated p-value of each linear combination. MULTINOM is particularly useful when some of the observed counts are zero. In this case, the CATMOD procedure is not useful since the model parameters are not well estimated.

Example 1: Compare the probabilities of levels of a multinomial variable

Using the MULTINOM module

Suppose you take a sample of 100 objects from a population in which objects are of three possible types, 1, 2, or 3. You observe 10 objects of type 1, 18 of type 2, and 72 of type 3. You want to test for differences among the probabilities of the three types. After defining the MULTINOM module (by submitting all of this code to SAS) the following statement calls MULTINOM and produces confidence intervals and hypothesis tests for each pairwise comparison of the types. The missing value in the last argument results in all pairwise comparisons. You can also estimate and test particular linear combinations by specifying a matrix, with one linear combination per row, in the last argument. The third argument specifies that each interval and test be conducted at the 95% confidence level.

      run multinom( {10 18 72} , "I" , 0.05 , . );
The MULTINOM module

All Pairwise Differences

95% Individual Large-Sample Confidence Limits For Pi-Pj

I J Estimate LCL UCL ChiSq Pr>ChiSq
1 2 -.0800 -.1825 0.0225 2.3392 0.1262
1 3 -.6200 -.7494 -.4906 88.246 <.0001
2 3 -.5400 -.6929 -.3871 47.929 <.0001

By setting the second argument to "S" rather than "I", the Bonferroni adjustment is applied to the intervals and tests so that they have a simultaneous 95% confidence level:

      run multinom( {10 18 72} , "S" , 0.05 , . );
The MULTINOM module

All Pairwise Differences

95% Simultaneous Large-Sample Confidence Limits For Pi-Pj

I J Estimate LCL UCL ChiSq Pr>ChiSq
1 2 -.0800 -.2052 0.0452 2.3392 0.3785
1 3 -.6200 -.7780 -.4620 88.246 <.0001
2 3 -.5400 -.7267 -.3533 47.929 <.0001

The results show that the probability of type 3 differs significantly from both type 1 and type 2 (p<0.0001), but the probabilities of types 1 and 2 are not significantly different (p=0.3785). A confidence interval for each difference of multinomial probabilities is given. The overall type 1 error rate is controlled at 0.05.

Using PROC CATMOD

The analysis can also be done with PROC CATMOD as follows. First, a data set is created containing the observed counts for each level of the multinomial variable, Y.

      data a;
        input y count;
        datalines;
      1 10
      2 18
      3 72
      ;

In the CATMOD statements below, the RESPONSE statement defines two response functions being two of the pairwise differences, p1-p2 and p1-p3. Only two can be defined since the set of response functions must be independent and adding the third pairwise difference, p2-p3, would be redundant (it is the difference of the first two differences). This last difference is obtained using the CONTRAST statement. Since the model contains only an intercept for each response function (no predictors are specified), the intercepts estimate the two pairwise differences and the difference in the two intercepts is the p2-p3 difference. The ESTIMATE=PARM option provides a confidence interval for the p2-p3 difference. The CLPARM option in the MODEL statement requests confidence intervals for the other pairwise differences. The ODS OUTPUT statement writes the parameter and contrast results, particularly the p-values, to data sets.

      ods output estimates=pe contrasts=c;
      proc catmod data=a;
        response 1 -1 0, 1 0 -1;
        weight count;
        model y= / clparm;
        contrast '2 vs 3' @1 intercept -1 @2 intercept 1 / estimate=parm;
        run; quit;
Analysis of Weighted Least Squares Estimates
Parameter Function
Number
Estimate Standard
Error
Chi-
Square
Pr > ChiSq 95% Confidence Limits
Intercept 1 -0.0800 0.0523 2.34 0.1262 -0.1825 0.0225
  2 -0.6200 0.0660 88.25 <.0001 -0.7494 -0.4906

Analysis of Contrasts
Contrast DF Chi-Square Pr > ChiSq
2 vs 3 1 47.93 <.0001

Analysis of Contrasts of Weighted Least Squares Estimates
Contrast Type Row Estimate Standard
Error
Chi-Square Pr > ChiSq Alpha Confidence Limits
2 vs 3 PARM 1 -0.5400 0.0780 47.93 <.0001 0.05 -0.6929 -0.3871

Notice that the chi-square statistics and p-values agree with those from the first MULTINOM analysis.

Bonferroni adjustment (or other types of adjustment) of the p-values can be done with PROC MULTTEST. First, the parameter and contrast data sets are combined and the raw p-values are placed in a variable named RAW_P, which PROC MULTTEST requires in order to adjust the p-values.

      data pvals;
        set pe c;
        raw_p=ProbChiSq;
        run;
      proc multtest pdata=pvals bon; 
        run;
The Multtest Procedure

p-Values
Test Raw Bonferroni
1 0.1262 0.3785
2 <.0001 <.0001
3 <.0001 <.0001

Notice that the adjusted p-values agree with the Bonferroni adjusted p-values in the second MULTINOM analysis.

Example 2: Confidence intervals for multinomial probabilities

Using the MULTINOM module

Tests and confidence intervals for the multinomial probabilities can be generated using the MULTINOM module by specifying an identity matrix as its final argument. For the example above, this call of MULTINOM provides the Bonferroni adjusted 95% confidence intervals:

      run multinom( {10 18 72} , "S" , 0.05 , {1 0 0, 0 1 0, 0 0 1} );
The MULTINOM module

User-Defined Linear Combinations

95% Individual Large-Sample Confidence Limits

Ci Estimate LCL UCL ChiSq Pr>ChiSq
1 0 0 0.1000 0.0412 0.1588 11.111 0.0009
0 1 0 0.1800 0.1047 0.2553 21.951 <.0001
0 0 1 0.7200 0.6320 0.8080 257.14 <.0001

Using PROC CATMOD

The same analysis can be done using PROC CATMOD by modeling the individual probabilities. The RESPONSE statement below selects the first two of the three response level probabilities. You cannot specify all three because they are not independent (they sum to one).

      proc catmod data=a;
        response 1 0 0, 0 1 0;
        weight count;
        model y= / clparm;
        run; quit;
Analysis of Weighted Least Squares Estimates
Parameter Function
Number
Estimate Standard
Error
Chi-
Square
Pr > ChiSq 95% Confidence Limits
Intercept 1 0.1000 0.0300 11.11 0.0009 0.0412 0.1588
  2 0.1800 0.0384 21.95 <.0001 0.1047 0.2553

Since only two of the probabilities are independent in a set of three, only two can be estimated at a time. The following RESPONSE statement picks out the third of the response level probabilities.

      proc catmod data=a;
        response 0 0 1;
        weight count;
        model y= / clparm;
        run; quit;
Analysis of Weighted Least Squares Estimates
Parameter Estimate Standard
Error
Chi-
Square
Pr > ChiSq 95% Confidence Limits
Intercept 0.7200 0.0449 257.14 <.0001 0.6320 0.8080

Example 3: Confidence intervals for multinomial probabilities with multiple populations

For the case with predictors which define multiple populations, you need to specify the design matrix in PROC CATMOD. Here is an example with two populations (two levels of predictor A). The design matrix is divided into sets of rows for each population, so the first two rows are for A1 and the second two rows are for A2. The rows within a population are for each response function as defined in the RESPONSE statement. This same method extends to the use of multiple predictors.

      data a;
        input a y count;
        datalines;
        1 1 10
        1 2 18
        1 3 72
        2 1 10
        2 2 20
        2 3 70
        ;
        
      proc catmod data=a;
        response 1 0 0, 0 1 0;
        weight count;
        population a;
        model y=(1 0 0 0,
                 0 1 0 0,
                 0 0 1 0,
                 0 0 0 1)
                (1="A1,p1",
                 2="A1,p2",
                 3="A2,p1",
                 4="A2,p2") / clparm;
        run; quit;
Analysis of Weighted Least Squares Estimates
Effect Parameter Estimate Standard
Error
Chi-
Square
Pr > ChiSq 95% Confidence Limits
Model 1 0.1000 0.0300 11.11 0.0009 0.0412 0.1588
  2 0.1800 0.0384 21.95 <.0001 0.1047 0.2553
  3 0.1000 0.0300 11.11 0.0009 0.0412 0.1588
  4 0.2000 0.0400 25.00 <.0001 0.1216 0.2784

As in the previous example, an additional step is needed to estimate the last probability.

      proc catmod data=a;
        response 0 0 1;
        weight count;
        population a;
        model y=(1 0,
                 0 1)
                (1="A1,p3",
                 2="A2,p3") / clparm;
        run; quit;
Analysis of Weighted Least Squares Estimates
Effect Parameter Estimate Standard
Error
Chi-
Square
Pr > ChiSq 95% Confidence Limits
Model 1 0.7200 0.0449 257.14 <.0001 0.6320 0.8080
  2 0.7000 0.0458 233.33 <.0001 0.6102 0.7898

1 Berry, J.J. and Hurtado, G.I. (1994), "Comparing Non-Independent Proportions,"[POSTSCRIPT] Observations: The Technical Journal for SAS Software Users, 3(4), 21-27.



Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows XP Professional
Windows Millennium Edition (Me)
Windows Vista
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.