The HPLOGISTIC Procedure

Example 5.4 Conditional Logistic Regression for Matched Pairs Data

In matched pairs (case-control) studies, conditional logistic regression is used to investigate the relationship between an outcome of being an event (case) or a non-event (control) and a set of prognostic factors.

The following data are a subset of the data from the Los Angeles Study of the Endometrial Cancer Data in Breslow and Day (1980). There are 63 matched pairs, each consisting of a case of endometrial cancer (Outcome=1) and a control (Outcome=0). The case and corresponding control have the same ID. Two prognostic factors are included: Gall (an indicator variable for gall bladder disease) and Hyper (an indicator variable for hypertension). The goal of the case-control analysis is to determine the relative risk for gall bladder disease, controlling for the effect of hypertension.

data Data1;
  do ID=1 to 63;
    do Outcome = 1 to 0 by -1;
      input Gall Hyper @@;
      output;
    end;
  end;
  datalines; 
0 0  0 0    0 0  0 0    0 1  0 1    0 0  1 0    1 0  0 1    
0 1  0 0    1 0  0 0    1 1  0 1    0 0  0 0    0 0  0 0    
1 0  0 0    0 0  0 1    1 0  0 1    1 0  1 0    1 0  0 1    
0 1  0 0    0 0  1 1    0 0  1 1    0 0  0 1    0 1  0 0   
0 0  1 1    0 1  0 1    0 1  0 0    0 0  0 0    0 0  0 0    
0 0  0 1    1 0  0 1    0 0  0 1    1 0  0 0    0 1  0 0    
0 1  0 0    0 1  0 0    0 1  0 0    0 0  0 0    1 1  1 1    
0 0  0 1    0 1  0 0    0 1  0 1    0 1  0 1    0 1  0 0   
0 0  0 0    0 1  1 0    0 0  0 1    0 0  0 0    1 0  0 0    
0 0  0 0    1 1  0 0    0 1  0 0    0 0  0 0    0 1  0 1    
0 0  0 0    0 1  0 1    0 1  0 0    0 1  0 0    1 0  0 0    
0 0  0 0    1 1  1 0    0 0  0 0    0 0  0 0    1 1  0 0   
1 0  1 0    0 1  0 0    1 0  0 0    
;

When each matched set consists of one event and one non-event, the conditional likelihood is given by

\[  \prod _ i(1+\exp (-(\bm {x}_{i1}-\bm {x}_{i0})’\bpsi )^{-1}  \]

where $\bm {x}_{i1}$ and $\bm {x}_{i0}$ are vectors that represent the prognostic factors for the event and non-event, respectively, of the $i$th matched set. This likelihood is identical to the likelihood of fitting a logistic regression model to a set of data with constant response, where the model contains no intercept term and has explanatory variables given by $\bm {d}_ i=\bm {x}_{i1} - \bm {x}_{i0}$ (Breslow, 1982).

To apply this method, the following DATA step transforms each matched pair into a single observation, where the variables Gall and Hyper contain the differences between the corresponding values for the case and the control (case – control). The variable Outcome, which is used as the response variable in the logistic regression model, is given a constant value of 0 (which is the Outcome value for the control, although any constant, numeric or character, suffices).

data Data2;
   set Data1;
   drop id1 gall1 hyper1;
   retain id1 gall1 hyper1 0;
   if (ID = id1) then do;
      Gall=gall1-Gall; Hyper=hyper1-Hyper;
      output;
   end;
   else do;
      id1=ID; gall1=Gall; hyper1=Hyper;
   end;
run;

Note that there are 63 observations in the data set, one for each matched pair. Since the number of observations (n) is halved, statistics that depend on n such as $R^2$ will be incorrect. The variable Outcome has a constant value of 0.

In the following statements, PROC HPLOGISTIC is invoked with the NOINT option to obtain the conditional logistic model estimates. Because the option CL is specified, PROC HPLOGISTIC computes a 95% confidence interval for the parameter.

proc hplogistic data=Data2;
   model outcome=Gall / noint cl;
run;

Results from the conditional logistic analysis are shown in Output 5.4.1 through Output 5.4.3.

Output 5.4.1 shows that you are fitting a binary logistic regression where the response variable Outcome has only one level.

Output 5.4.1: Conditional Logistic Regression (Gall as Risk Factor)

Multiple Response Cheese Tasting Experiment

The HPLOGISTIC Procedure

Performance Information
Execution Mode Single-Machine
Number of Threads 4

Model Information
Data Source WORK.DATA2
Response Variable Outcome
Distribution Binary
Link Function Logit
Optimization Technique Newton-Raphson with Ridging

Number of Observations Read 63
Number of Observations Used 63

Response Profile
Ordered
Value
Outcome Total
Frequency
1 0 63

You are modeling the probability that Outcome='0'.



Output 5.4.2 shows that the model is marginally significant (p=0.0550).

Output 5.4.2: Conditional Logistic Regression (Gall as Risk Factor)

Iteration History
Iteration Evaluations Objective
Function
Change Max Gradient
0 4 0.6662698453 . 0.015669
1 2 0.6639330101 0.00233684 0.001351
2 2 0.6639171997 0.00001581 6.88E-6
3 2 0.6639171993 0.00000000 1.83E-10

Convergence criterion (GCONV=1E-8) satisfied.

Dimensions
Columns in X 1
Number of Effects 1
Max Effect Columns 1
Rank of Cross-product Matrix 1
Parameters in Optimization 1

Fit Statistics
-2 Log Likelihood 83.6536
AIC (smaller is better) 85.6536
AICC (smaller is better) 85.7191
BIC (smaller is better) 87.7967

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 3.6830 1 0.0550


Note that there is no intercept term in the Parameter Estimates table in Output 5.4.3. The intercepts have been conditioned out of the analysis.

Output 5.4.3: Conditional Logistic Regression (Gall as Risk Factor)

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t| Alpha Lower Upper
Gall 0.9555 0.5262 Infty 1.82 0.0694 0.05 -0.07589 1.9869


The odds ratio estimate for Gall is $\exp (0.9555)=2.60$, which is marginally significant ($p$=0.0694) and which is an estimate of the relative risk for gall bladder disease. A subject who has gall bladder disease has 2.6 times the odds of having endometrial cancer as a subject who does not have gall bladder disease. A 95% confidence interval for this relative risk, produced by exponentiating the confidence interval for the parameter, is (0.927, 7.293).