Miscellaneous Examples for PROC LOGISTIC

``` /****************************************************************/
/*          S A S   S A M P L E   L I B R A R Y                 */
/*                                                              */
/*    NAME: LOGIMISC                                            */
/*   TITLE: Miscellaneous Examples for PROC LOGISTIC            */
/* PRODUCT: STAT                                                */
/*  SYSTEM: ALL                                                 */
/*    KEYS: logistic regression analysis,                       */
/*   PROCS: LOGISTIC                                            */
/*    DATA:                                                     */
/*                                                              */
/* SUPPORT: YCS                   UPDATE:                       */
/*     REF:                                                     */
/*          chapter                                             */
/*    MISC:                                                     */
/*                                                              */
/****************************************************************/

/*****************************************************************
Example 1. Using Proc Logistic for Poisson Regression
*****************************************************************/

/*
The data, taken from Koch et. al (1986), consist of counts of
new melanoma cases reported during 1969-1971 among white males
and the corresponding estimated populations at risk for six
age groups and two geographic regions.  Possion regression
was proposed to study the vital statistics factors on new
melanoma cases.  Since the event frequencies are relatively
small as compared to the corresponding population sizes, the
logistic regression estimates are essentially the same as
the Poisson regression estimates.
*/

title1 'Example 1. Poisson Regression';
data;
drop case_n case_s pop_n pop_s;
length age_grp \$ 5;
input age_grp \$ case_n case_s pop_n pop_s;
age_grp2 = ( age_grp = '35-44' );
age_grp3 = ( age_grp = '45-54' );
age_grp4 = ( age_grp = '55-64' );
age_grp5 = ( age_grp = '65-74' );
age_grp6 = ( age_grp = '>=75' );
case = case_n; pop = pop_n; region = 0; output;
case = case_s; pop = pop_s; region = 1; output;
label region = '0 = NORTHERN  1 = SOUTHERN';
datalines;
<35    61 64 2880262 1074246
35-44  76 75  564535  220407
45-54  98 68  592983  198119
55-64 104 63  450740  134084
65-74  63 45  270908   70708
>=75   80 27  161850   34233
;

proc logistic;
model case/pop=age_grp2-age_grp6 region;
run;

/*****************************************************************
Example 2. Overdispersion in Quantal Assay Data
*****************************************************************/

/*
In bioassay, one is often interested in the median effect dose,
ED50, which is the concentration of a chemical that is expected
to produce a response in 50% of the subjects exposed to it.
By fitting simple models such as logit, probit, or complementary
log-log, ED50 is a simple function of the regression coeffi-
cients. However, the basic underlying assumption that the
quantal assay data follow a binomial distribution may not
always be appropriate. This example presents an analysis that
incorporates extra-binomial variation to the data.

The following SAS statements create the data set ASSAY, which
contains a set of dose response data. Variable LOGCONC
represents the log concentration of a chemical, N represents
the number of subjects dosed, and R represents the number of
subjects responded.

PROC LOGISTIC is first used to fit a logit model to the quantal
assay data. The option SCALE=NONE is specified to display the
goodness-of-fit statistics. Linear predictor values and
Pearson residuals are output to data set OUT1, which is
subsequently used in PROC GPLOT to display the adequacy of
the fit.
*/

title1 'Example 2. Overdispersion in Quantal Assay Data';
data assay;
input logconc r n;
logit=log(r/(n-r));
datalines;
2.68  10  31
2.76  17  30
2.82  12  31
2.90   7  27
3.02  23  26
3.04  22  30
3.13  29  31
3.20  29  30
3.21  23  30
;

proc logistic data=assay outest=dn covout;
model r/n= logconc / scale=none;
output out=out1 xbeta=xb reschi=reschi;
run;

proc sgplot data=out1 noautolegend;
title 'Plot of Data with Fitted Line';
scatter y=logit x=logconc;
series  y=xb    x=logconc;
label logit   = 'Logit of Proportion Responded'
logconc = 'Log Concentration';
run;

proc sgplot data=out1;
title 'Residual Plot';
scatter y=reschi x=logconc;
label logit   = 'Pearson Residual'
logconc = 'Log Concentration';
run;

/*
Both goodness-of-fits statistics are statistically significant,
indicating a lack of fit of the model. However, there is not
a clear pattern of Pearson residuals and the fit of the
model appears to be adequate. This suggests that extra-binomial
variation is responsible for the poor fit.

The Williams method of modeling overdispersion is then used.
This is requested by specifying the SCALE=WILLIAMS option.
*/

title2 'Scale=Williams';
proc logistic data=assay outest=dw covout;
model r/n= logconc / scale=williams;
run;

/*
With log of concentration as the only explanatory variable,
ED50 is estimated by exp(-intercept/slope). A 95% confidence
interval of ED50 is obtained by exponentiating the 95%
confidence interval for (-intercept/slope). The latter is
conveniently calculated using Fieller's theorem.

In a single DATA step, you can compute the confidence limits
for ED50 from the OUTEST= data set that contains both
parameter estimates and covariance matrix. The following SAS
statements compute the 95% confidence interval for ED50 for
each of the SCALE= option available in PROC LOGISTIC. An
OUTEST= data set that contains both estimates and dispersion
is first output for each SCALE= option.  The four OUTEST= data
sets are then concatenated in data set TWO and confidence
limits are subsequently calculated for each data set.
*/

title2 'Scale=Pearson';
proc logistic data=assay outest=dp covout noprint;
model r/n= logconc / scale=p;
run;

title2 'Scale=Deviance';
proc logistic data=assay outest=dd covout noprint;
model r/n= logconc / scale=d;
run;

data two(keep=scale ed50 lower upper);
retain b0 b1 v00 v01 v11;
set dn(in=dn) dw(in=dw) dp(in=dp) dd(in=dd);
if _name_='r' then do;
b0= INTERCEPT;
b1= logconc;
end;
if _name_='Intercept' then do;
v00=INTERCEPT;
v01=logconc;
end;
if _name_='logconc ' then do;
if dn=1 then scale='NONE    ';
else if dw=1 then scale='WILLIAMS';
else if dp=1 then scale='PEARSON ';
else if dd=1 then scale='DEVIANCE';
v11=logconc;
theta=-b0/b1;
t975=tinv(.975,7);
c=t975*t975*v11/(b1*b1);
term1=theta+(c/(1-c))*(theta + v01/v11);
term2=t975/(b1*(1-c));
term3=(v00+2*theta*v01+theta*theta*
v11-c*(v00-v01*v01/v11));
term4=term2*sqrt(term3);
l1=term1-term4;
u1=term1+term4;
ed50= exp(theta);
lower=exp(l1);
upper=exp(u1);
output;
end;
run;

title2 'ED50 and 95% Confidence Interval';
proc print data=two;
id scale;
run;

/*****************************************************************
Example 3. 1:M Matching
*****************************************************************/

/*
Consider the special case that (i) there is only one case in
each matched set and the number of controls are the same for all
sets, and (ii) there is only a single binary covariable with
relative risk exp(beta). The Samsu consumption data in Breslow
(1982) contains 80 matched sets. Each matched set has one case
and four controls. The distribution according to Samsu exposure
is given in the following table:

Case          Total Number (Case + Controls)
0   1   2   3   4  5

Exposed           .   5  19  10   6  0
Not Exposed      10  15   8   7   0  .
Total            10  20  27  17   6  .

Let n_0m be the number of sets in which exactly m controls are
exposed and the case is not; and let n_1m be the number of sets
in which the case and m controls are exposed. The likelihood of
the conditional logistic regression model is proportional to the
product of the likelihood of the binomial distribution
B(N_m, theta_m), where

N_m = n_0m + n_1m
and
theta_m = m * exp(beta) / [m * exp(beta) + 5 - m]

Since logit(theta_m) = log( m/(5-m) + beta ), beta can be
estimated as a parameter in a logistic regression model with
no intercept and an offset value of m/(5-m).

This analysis is performed with the PROC LOGISTIC invocation
below.
*/

title1 'Example 3. 1:M Matching';
data samsu;
input m r n;
samsu = 1;
off= log(m/(5-m));
datalines;
1  5 20
2 19 27
3 10 17
4  6  6
;

proc logistic data=samsu;
model r/n = samsu / offset=off noint;
run;

/*
Note that you can perform the same analysis using the STRATA statement to
perform a conditional logistic regression.

First create the SAMSU data set as a single record for each subject.
*/

data one;
input exp1-exp5 freq;
datalines;
1 0 0 0 0 5
1 1 0 0 0 19
1 1 1 0 0 10
1 1 1 1 0 6
1 1 1 1 1 0
0 0 0 0 0 10
0 1 0 0 0 15
0 1 1 0 0 8
0 1 1 1 0 7
0 1 1 1 1 0
;

data two;
set one;
retain id 0;
do i=1 to freq;
id=id+1;
output;
end;
run;

data three;
set two;
case=1; samsu=exp1; output;
case=0; samsu=exp2; output;
samsu=exp3; output;
samsu=exp4; output;
samsu=exp5; output;
keep case samsu id;
run;

/*
Then run a conditional logistic regression.
*/

proc logistic data=three;
model case(event='1')=samsu;
strata id;
run;

```