|
Some Examples of Using the IPF function in IML
IPF users have occasionally faced the following problem. You have one or more
sets of marginal frequencies available from one source, and you want to use
these margins to adjust the frequencies in a contigency table derived from a
second source. The IPF function requires that the marginal tables be combined
into a single "complete" contigency table (called TABLE in the IPF parameter
list). The question is: how do you form this "complete" table on the basis
of marginal subtables?
This document demonstrates that the IML procedure makes it easy to build a
"complete" table based on the margins of the available tables, and that IML
can then proceed in a straghtforward manner to get the ML estimates from the
IPF function. Two examples are used to illustrate the method.
1. The first example is a simple one in which you have 2 sets of margins,
but no data from a second source. (In IPF notation, there is no INIT
matrix because there are no higher-order interactions to preserve from a
second source of data). Suppose you have 2 two-dimensional tables: X*Y
and Y*Z, and you want to infer the ML estimates of the X*Z table.
2. The second example is the 5-variable model given in Bishop, Fienberg,
and Holland (hereafter BF&H), pp. 107-111. You have a 4-way table
(SEX*SCHOOL*AGE*REGION), for which you want to preserve the given
interactions. You have 3 new marginal tables: each one a cross-
classification of a fifth variable (PARTY) with one of the original
variables (SEX, SCHOOL, or AGE). You want to infer the ML estimates of
a subtable of PARTY*REGION, preserving all of the given margins in all
of the marginal tables.
******************************************************************************
EXAMPLE 1:
The data and the corresponding analysis are on pages 3 and 4. The program has
numerous comments and is pretty much self-explanatory, but I would note the
following:
1. The computation of the "complete" table requires 7 lines, and uses
nothing more complicated than the min() function. It is done under Step
2 in the section entitled "COMBINE THE 2 JOINT DISTRIBUTIONS INTO A 3-
WAY TABLE."
2. I included a number of extra sections in the program that are not
strictly necessary. The only purpose of the end of Step 2 is to set up
anm output data set so that one can verify easily (with proc FREQ in
Step 3) that the given margins are preserved. Clearly, one could skip
that, and go directly to the IPF step without leaving IML.
EXAMPLE 2:
The code to input the four marginal tables is on page 5. The IML step to
compute the fitted values is on page 7. The results are on page 7. Note the
following:
1. The computation of the "complete" table requires 9 lines, and uses
nothing more complicated than the min() function. It is done under Step
2 in the section entitled "COMBINE THE 3 MARGINS INTO A 3-WAY TABLE."
2. The method of computation in Method 1 in BF&H, page 109.
3. The TABLE matrix contains the new 1-way margins for PARTY members. The
IPF function specifies only these 1-way marginals so that any higher-
order interactions inadvertently built into TABLE are not propogated
into the fitted estimates. Instead, the higher-order interactions are
obtained from the INIT matrix, which is nothing more than the original
4-way table.
4. The results on page 7 are different from the results in BF&H, page 111.
This discrepancy undoubtedly arises from the following fact. Strictly
speaking, the method of IPF is invalidated when zero marginals must be
propogated into the internal cells of a table that originally had
nonzero margins. BF&H allude to this problem in the second sentence on
page 101, but fail to point out that this exact problem occurs in the
current data due to the zero marginal for party members in the table of
PARTY*SCHOOL (Table 3.7-5 in BF&H, p. 109).
IPF is supposed to maintain the higher-order interactions of the initial
table (in this case, the initial 4-way table). However, one glance at
the estimates on page 111 makes it clear that this is not possible; any
odds ratio comparing the first 2 levels of SCHOOL must be undefined
because of the zeros in column 1. The best that one can do is to
preserve the odds ratios for those comparisons that do not involve zero
frequencies. The SAS IPF algorithm does that, as shown in the following
item.
5. To verify that IPF is preserving interactions to the extent possible
with zero frequencies, the CATMOD procedure was used to compute the 6
logg odds ratios for each AGE*SCHOOL table within the 4 categories of
SEX*REGION. This was done for (1) the original 4-way table, (2) the SAS
IPF estimates for party members, and (3) the BF&H estimates for party
members. Log 0 is defined to be 0.
6. The results are shown on page 8 (there are no BF&H estimates available
for rural region). The results show that the original log odds ratios
are preserved by the SAS IPF function estimates, except for the ones
that are restricted to be zero because of the zero marginal. It is not
clear how BF&H arrived at their results, which preserve none of the
original logg-odss ratios.
****************************************************************;
* Example 1 for IPF ;
****************************************************************;
****************************************************************;
* Step 1: Input the data ;
****************************************************************;
/*---Table of A*B---*/
data ab;
input a b wt @@;
cards;
1 1 8 1 2 2 2 1 4 2 2 4
;
run;
proc freq;
weight wt;
table a*b / list;
run;
/*---Table of C*B---*/
data cb;
input c b wt @@;
cards;
1 1 6 1 2 0 2 1 6 2 2 6
;
run;
proc freq;
weight wt;
table c*b / list;
run;
****************************************************************;
* Step 2: Use IML to create the "complete" table ;
****************************************************************;
proc iml;
/*----compute joint distribution for a by b----*/
use ab;
read all into abyb;
ab=shape(abyb[,3], 2, 2);
/*----compute joint distribution for c by b---*/
use cb;
read all into cbyb;
cb=shape(cbyb[,3], 2, 2);
/*----combine the 2 joint distributions into a 3-way table----*/
i=1;
tab=shape(0,1,8);
do a=1 to 2;
do b=1 to 2;
do c=1 to 2;
tab[i]=min(ab[a,b], cb[c,b]);
ab[a,b]=ab[a,b]-tab[i];
cb[c,b]=cb[c,b]-tab[i];
i=i+1;
end;
end;
end;
/*----set up the variable profiles and the variable names----*/
r1={1 1 1 1 2 2 2 2};
r2={1 1 2 2 1 1 2 2};
r3={1 2 1 2 1 2 1 2};
cname={ a b c wt };
matrix=r1//r2//r3//tab;
matrix=matrix`;
/*---output to a sas data set---*/
create out from matrix[colname=cname];
append from matrix;
quit;
****************************************************************;
* Step 3: verify that "complete" table has correct 2-way margins;
****************************************************************;
proc print data=out;
run;
proc freq data=out;
weight wt;
table a*b c*b / list;
run;
***************************************************************;
* Step 4: use IML to do IPF for model (ab, cb) ;
***************************************************************;
proc iml;
use out;
read all into matrix;
/* var3 var2 var1 */
/* ----- ----- ----- */
/* c b a */
table=matrix[,4];
dim={2 2 2};
config={1 3, 2 2};
call ipf(fit, status, dim, table, config);
matrix[,4]=fit;
cname={ a b c wt };
create fit from matrix[colname=cname];
append from matrix;
quit;
***************************************************************;
* Step 5: verify that the "fitted" table has correct 2-way ;
* margins. Project the fitted table onto the A*C plane. ;
***************************************************************;
proc print data=fit;
run;
proc freq data=fit;
weight wt;
table a*b c*b a*c / list;
run;
******************************************************************;
* Example 2 for IPF ;
******************************************************************;
******************************************************************;
* Step 1: Input the data ;
******************************************************************;
data t374;
do sex=1 to 2;
do age=1 to 3;
do region=1 to 2;
do school=1 to 4;
input wt @;
output;
end;
end;
end;
end;
cards;
631571 4381454 7557356 915307 1719046 5266179 5718852 153669
1176623 4174797 5384310 1055949 2386652 4297528 3050642 316853
2376214 1807879 1187537 501741 5378665 1862132 376084 68963
664308 2843925 9180042 1274867 2536078 4246961 5886728 232464
3825831 4481894 6873450 1070910 7181386 4653138 3089527 241507
7432953 1888644 1312793 275948 13089113 1071037 207498 24283
;
run;
/*---a table containing 1 or more old vars, & 1 or more new vars---*/
data t375;
do party=1 to 2;
do school=1 to 4;
input wt @;
output;
end;
end;
cards;
0 2290478 4580957 1367696
48398440 38685090 45243862 4764765
;
run;
data t376;
do party=1 to 2;
do sex=1 to 2;
input wt @;
output;
end;
end;
cards;
6632500 1606631
55113503 81978654
;
run;
data t377;
do party=1 to 2;
do age=1 to 3;
input wt @;
output;
end;
end;
cards;
1623109 4663348 1952674
51585698 48597649 36908810
;
run;
/*---reorder the initial data in a data set called init---*/
proc freq data=t374;
table school*sex*age*region / noprint out=init;
weight wt;
run;
******************************************************************;
* Step 2: run IML to do iterative proportional fitting ;
******************************************************************;
proc iml;
/*---compute marginals for school (party members only)---*/
use t375;
read all into t375;
schoolp=shape(t375[,3], 2, 4);
schpy=schoolp[1,];
/*---compute marginals for sex (party members only)---*/
use t376;
read all into t376;
sexp=shape(t376[,3], 2, 2);
sexpy=sexp[1,];
/*---compute marginals for age (party members only)---*/
use t377;
read all into t377;
agep=shape(t377[,3], 2, 3);
agepy=agep[1,];
/*---combine marginals into a 3-way table---*/
i=1;
tab=shape(0,1,24);
do school=1 to 4;
do sex=1 to 2;
do age=1 to 3;
a=min(schpy[school], sexpy[sex]);
a=min(a,agepy[age]);
tab[i]=a;
i=i+1;
schpy[school]=schpy[school]-a;
sexpy[sex]=sexpy[sex]-a;
agepy[age]=agepy[age]-a;
end;
end;
end;
/*---introduce region, creating a 4-way table---*/
table=tab`@{.5, .5};
/*---read in 1959 initial 4-way table, & adjust for new margins---*/
use init;
read all into mat;
init=mat[,5];
dim={ 2 3 2 4 };
/* var4 var3 var2 var1 */
/* ------ ---- ---- ------ */
cname={ school sex age region count };
config={ 2 3 4 };
call ipf(fit, status, dim, table, config, init);
/*---output the profiles and the fitted frequencies---*/
mat[,5]=fit;
mat=mat[,1:5];
create party from mat[colname=cname];
append from mat;
quit;
******************************************************************;
* Step 3: print the results for region=1 (urban) only ;
******************************************************************;
options ls=71;
data ex2step3;
set party;
if region=1;
output;
drop region;
run;
proc print;
run;
OBS SCHOOL SEX AGE COUNT
1 1 1 1 0.00
2 1 1 2 0.00
3 1 1 3 0.00
4 1 2 1 0.00
5 1 2 2 0.00
6 1 2 3 0.00
7 2 1 1 130144.93
8 2 1 2 469654.23
9 2 1 3 314525.54
10 2 2 1 19866.07
11 2 2 2 118573.93
12 2 2 3 77271.87
13 3 1 1 502528.35
14 3 1 2 1355984.17
15 3 1 3 462503.89
16 3 2 1 143555.59
17 3 2 2 407084.06
18 3 2 3 120240.08
19 4 1 1 108406.69
20 4 1 2 473659.36
21 4 1 3 348053.85
22 4 2 1 35509.04
23 4 2 2 112969.42
24 4 2 3 45017.24
Example 2, Continued
Estimated Log Odds Ratios for AGE * SCHOOL
By SEX and REGION
Orignal SAS IPF BF&H
Sex Region # 4-way Table Estimates Estimates
------ ----------- ---- ----------- --------- ---------
1 1 1 -0.67051 0.00000 0.00000
1 1 2 -2.21028 0.00000 0.00000
1 1 3 -0.29072 -0.29072 -0.44965
1 1 4 -0.96541 -0.96541 -1.39497
1 1 5 0.48197 0.48197 0.50422
1 1 6 1.24946 1.24946 1.19495
1 2 1 -0.53139 0.00000 -------
1 2 2 -2.18025 0.00000 -------
1 2 3 -0.42515 -0.42515 -------
1 2 4 -1.68213 -1.68213 -------
1 2 5 1.35205 1.35205 -------
1 2 6 1.92048 1.92048 -------
2 1 1 -1.29592 0.00000 0.00000
2 1 2 -2.82426 0.00000 0.00000
2 1 3 -0.74423 -0.74423 -0.69594
2 1 4 -1.53555 -1.53555 -1.47673
2 1 5 0.11503 0.11503 0.21765
2 1 6 0.41449 0.41449 0.57172
2 2 1 -0.94954 0.00000 -------
2 2 2 -3.01874 0.00000 -------
2 2 3 -0.73602 -0.73602 -------
2 2 4 -1.96776 -1.96776 -------
2 2 5 0.68285 0.68285 -------
2 2 6 1.08638 1.08638 -------
|