32609 - Testing hypotheses or getting confidence intervals for the proportions of a multinomial variable

SUPPORT / SAMPLES & SAS NOTES

Support

Usage Note 32609: Testing hypotheses or getting confidence intervals for the proportions of a multinomial variable

Suppose each subject in a sample of many subjects is categorized into one level of a multinomial variable and you want to test one or more hypotheses concerning the probabilities associated with the levels. You also want to get confidence intervals for probabilities of each of the levels.

This can be done in PROC CATMOD using the RESPONSE and CONTRAST statements. Also, ¹ Berry and Hurtado (1994) [POSTSCRIPT] presented the MULTINOM SAS/IML module which gives confidence intervals for linear combinations (such as pairwise differences) among such dependent proportions. A slight modification of MULTINOM provides a test statistic and associated p-value of each linear combination. MULTINOM is particularly useful when some of the observed counts are zero. In this case, the CATMOD procedure is not useful since the model parameters are not well estimated.

Example 1: Compare the probabilities of levels of a multinomial variable

Using the MULTINOM module

Suppose you take a sample of 100 objects from a population in which objects are of three possible types, 1, 2, or 3. You observe 10 objects of type 1, 18 of type 2, and 72 of type 3. You want to test for differences among the probabilities of the three types. After defining the MULTINOM module (by submitting all of this code to SAS) the following statement calls MULTINOM and produces confidence intervals and hypothesis tests for each pairwise comparison of the types. The missing value in the last argument results in all pairwise comparisons. You can also estimate and test particular linear combinations by specifying a matrix, with one linear combination per row, in the last argument. The third argument specifies that each interval and test be conducted at the 95% confidence level.

      run multinom( {10 18 72} , "I" , 0.05 , . );

The MULTINOM module

All Pairwise Differences

95% Individual Large-Sample Confidence Limits For Pi-Pj

I	J	Estimate	LCL	UCL	ChiSq	Pr>ChiSq
1	2	-.0800	-.1825	0.0225	2.3392	0.1262
1	3	-.6200	-.7494	-.4906	88.246	<.0001
2	3	-.5400	-.6929	-.3871	47.929	<.0001

By setting the second argument to "S" rather than "I", the Bonferroni adjustment is applied to the intervals and tests so that they have a simultaneous 95% confidence level:

      run multinom( {10 18 72} , "S" , 0.05 , . );

The MULTINOM module

All Pairwise Differences

95% Simultaneous Large-Sample Confidence Limits For Pi-Pj

I	J	Estimate	LCL	UCL	ChiSq	Pr>ChiSq
1	2	-.0800	-.2052	0.0452	2.3392	0.3785
1	3	-.6200	-.7780	-.4620	88.246	<.0001
2	3	-.5400	-.7267	-.3533	47.929	<.0001

The results show that the probability of type 3 differs significantly from both type 1 and type 2 (p<0.0001), but the probabilities of types 1 and 2 are not significantly different (p=0.3785). A confidence interval for each difference of multinomial probabilities is given. The overall type 1 error rate is controlled at 0.05.

Using PROC CATMOD

The analysis can also be done with PROC CATMOD as follows. First, a data set is created containing the observed counts for each level of the multinomial variable, Y.

      data a;
        input y count;
        datalines;
      1 10
      2 18
      3 72
      ;

In the CATMOD statements below, the RESPONSE statement defines two response functions being two of the pairwise differences, p₁-p₂ and p₁-p₃. Only two can be defined since the set of response functions must be independent and adding the third pairwise difference, p₂-p₃, would be redundant (it is the difference of the first two differences). This last difference is obtained using the CONTRAST statement. Since the model contains only an intercept for each response function (no predictors are specified), the intercepts estimate the two pairwise differences and the difference in the two intercepts is the p₂-p₃ difference. The ESTIMATE=PARM option provides a confidence interval for the p₂-p₃ difference. The CLPARM option in the MODEL statement requests confidence intervals for the other pairwise differences. The ODS OUTPUT statement writes the parameter and contrast results, particularly the p-values, to data sets.

      ods output estimates=pe contrasts=c;
      proc catmod data=a;
        response 1 -1 0, 1 0 -1;
        weight count;
        model y= / clparm;
        contrast '2 vs 3' @1 intercept -1 @2 intercept 1 / estimate=parm;
        run; quit;

Analysis of Weighted Least Squares Estimates
Parameter	Function Number	Estimate	Standard Error	Chi- Square	Pr > ChiSq	95% Confidence Limits
Intercept	1	-0.0800	0.0523	2.34	0.1262	-0.1825	0.0225
	2	-0.6200	0.0660	88.25	<.0001	-0.7494	-0.4906

Analysis of Contrasts
Contrast	DF	Chi-Square	Pr > ChiSq
2 vs 3	1	47.93	<.0001

Analysis of Contrasts of Weighted Least Squares Estimates
Contrast	Type	Row	Estimate	Standard Error	Chi-Square	Pr > ChiSq	Alpha	Confidence Limits
2 vs 3	PARM	1	-0.5400	0.0780	47.93	<.0001	0.05	-0.6929	-0.3871

Notice that the chi-square statistics and p-values agree with those from the first MULTINOM analysis.

Bonferroni adjustment (or other types of adjustment) of the p-values can be done with PROC MULTTEST. First, the parameter and contrast data sets are combined and the raw p-values are placed in a variable named RAW_P, which PROC MULTTEST requires in order to adjust the p-values.

      data pvals;
        set pe c;
        raw_p=ProbChiSq;
        run;
      proc multtest pdata=pvals bon; 
        run;

The Multtest Procedure

p-Values
Test	Raw	Bonferroni
1	0.1262	0.3785
2	<.0001	<.0001
3	<.0001	<.0001

Notice that the adjusted p-values agree with the Bonferroni adjusted p-values in the second MULTINOM analysis.

Example 2: Confidence intervals for multinomial probabilities

Using the MULTINOM module

Tests and confidence intervals for the multinomial probabilities can be generated using the MULTINOM module by specifying an identity matrix as its final argument. For the example above, this call of MULTINOM provides the Bonferroni adjusted 95% confidence intervals:

      run multinom( {10 18 72} , "S" , 0.05 , {1 0 0, 0 1 0, 0 0 1} );

The MULTINOM module

User-Defined Linear Combinations

95% Individual Large-Sample Confidence Limits

Ci	Estimate	LCL	UCL	ChiSq	Pr>ChiSq
1 0 0	0.1000	0.0412	0.1588	11.111	0.0009
0 1 0	0.1800	0.1047	0.2553	21.951	<.0001
0 0 1	0.7200	0.6320	0.8080	257.14	<.0001

Using PROC CATMOD

The same analysis can be done using PROC CATMOD by modeling the individual probabilities. The RESPONSE statement below selects the first two of the three response level probabilities. You cannot specify all three because they are not independent (they sum to one).

      proc catmod data=a;
        response 1 0 0, 0 1 0;
        weight count;
        model y= / clparm;
        run; quit;

Analysis of Weighted Least Squares Estimates
Parameter	Function Number	Estimate	Standard Error	Chi- Square	Pr > ChiSq	95% Confidence Limits
Intercept	1	0.1000	0.0300	11.11	0.0009	0.0412	0.1588
	2	0.1800	0.0384	21.95	<.0001	0.1047	0.2553

Since only two of the probabilities are independent in a set of three, only two can be estimated at a time. The following RESPONSE statement picks out the third of the response level probabilities.

      proc catmod data=a;
        response 0 0 1;
        weight count;
        model y= / clparm;
        run; quit;

Analysis of Weighted Least Squares Estimates
Parameter	Estimate	Standard Error	Chi- Square	Pr > ChiSq	95% Confidence Limits
Intercept	0.7200	0.0449	257.14	<.0001	0.6320	0.8080

Example 3: Confidence intervals for multinomial probabilities with multiple populations

For the case with predictors which define multiple populations, you need to specify the design matrix in PROC CATMOD. Here is an example with two populations (two levels of predictor A). The design matrix is divided into sets of rows for each population, so the first two rows are for A1 and the second two rows are for A2. The rows within a population are for each response function as defined in the RESPONSE statement. This same method extends to the use of multiple predictors.

      data a;
        input a y count;
        datalines;
        1 1 10
        1 2 18
        1 3 72
        2 1 10
        2 2 20
        2 3 70
        ;
        
      proc catmod data=a;
        response 1 0 0, 0 1 0;
        weight count;
        population a;
        model y=(1 0 0 0,
                 0 1 0 0,
                 0 0 1 0,
                 0 0 0 1)
                (1="A1,p1",
                 2="A1,p2",
                 3="A2,p1",
                 4="A2,p2") / clparm;
        run; quit;

Analysis of Weighted Least Squares Estimates
Effect	Parameter	Estimate	Standard Error	Chi- Square	Pr > ChiSq	95% Confidence Limits
Model	1	0.1000	0.0300	11.11	0.0009	0.0412	0.1588
	2	0.1800	0.0384	21.95	<.0001	0.1047	0.2553
	3	0.1000	0.0300	11.11	0.0009	0.0412	0.1588
	4	0.2000	0.0400	25.00	<.0001	0.1216	0.2784

As in the previous example, an additional step is needed to estimate the last probability.

      proc catmod data=a;
        response 0 0 1;
        weight count;
        population a;
        model y=(1 0,
                 0 1)
                (1="A1,p3",
                 2="A2,p3") / clparm;
        run; quit;

Analysis of Weighted Least Squares Estimates
Effect	Parameter	Estimate	Standard Error	Chi- Square	Pr > ChiSq	95% Confidence Limits
Model	1	0.7200	0.0449	257.14	<.0001	0.6320	0.8080
	2	0.7000	0.0458	233.33	<.0001	0.6102	0.7898

¹ Berry, J.J. and Hurtado, G.I. (1994), "Comparing Non-Independent Proportions,"[POSTSCRIPT] Observations: The Technical Journal for SAS Software Users, 3(4), 21-27.

Operating System and Release Information

Product Family	Product	System	SAS Release
Product Family	Product	System	Reported	Fixed*
SAS System	SAS/STAT	z/OS
		OpenVMS VAX
		Microsoft® Windows® for 64-Bit Itanium-based Systems
		Microsoft Windows Server 2003 Datacenter 64-bit Edition
		Microsoft Windows Server 2003 Enterprise 64-bit Edition
		Microsoft Windows XP 64-bit Edition
		Microsoft® Windows® for x64
		OS/2
		Microsoft Windows 95/98
		Microsoft Windows 2000 Advanced Server
		Microsoft Windows 2000 Datacenter Server
		Microsoft Windows 2000 Server
		Microsoft Windows 2000 Professional
		Microsoft Windows NT Workstation
		Microsoft Windows Server 2003 Datacenter Edition
		Microsoft Windows Server 2003 Enterprise Edition
		Microsoft Windows Server 2003 Standard Edition
		Microsoft Windows XP Professional
		Windows Millennium Edition (Me)
		Windows Vista
		64-bit Enabled AIX
		64-bit Enabled HP-UX
		64-bit Enabled Solaris
		ABI+ for Intel Architecture
		AIX
		HP-UX
		HP-UX IPF
		IRIX
		Linux
		Linux for x64
		Linux on Itanium
		OpenVMS Alpha
		OpenVMS on HP Integrity
		Solaris
		Solaris for x64
		Tru64 UNIX

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:
Topic:	SAS Reference ==> Procedures ==> CATMOD SAS Reference ==> Procedures ==> IML SAS Reference ==> Procedures ==> MULTTEST Analytics ==> Categorical Data Analysis Analytics ==> Matrix Programming

Date Modified:	2008-07-02 14:19:43
Date Created:	2008-07-02 11:07:11

Support

Usage Note 32609: Testing hypotheses or getting confidence intervals for the proportions of a multinomial variable

Example 1: Compare the probabilities of levels of a multinomial variable

Example 2: Confidence intervals for multinomial probabilities

Example 3: Confidence intervals for multinomial probabilities with multiple populations

Operating System and Release Information

Follow Us

What is...