SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 63471: Estimating attributable risks when covariates are present using logistic and other models

DetailsAboutRate It

If an exposure variable, such as smoking, is thought to increase the occurrence of some outcome, such as lung cancer, it is often of interest to estimate the proportion reduction in the probability of the outcome that would be achieved if the exposure were removed. This is the intent of the population attributable risk, PAR. A related measure, the attributable risk among the exposed, ARE, assesses the proportion reduction among those exposed.

This note shows how these statistics can be computed using the data in a 2×2 table of exposure by outcome. However, when there are additional factors that affect the probability of outcome, they should be adjusted for when estimating the attributable risk statistics. That adjustment can be done using stratification, which is available in PROC STDRATE. Another approach is to use a suitable model such as a logistic model on the outcome that includes the confounding variables as covariates in the model.

The example titled "Computing Attributable Fraction Estimates" in the PROC STDRATE documentation illustrates the stratification method to adjust for covariates. The example below uses the modeling method to estimate the attributable risk statistics.

In cross-sectional studies, the attributable risk among the exposed is defined as ARE = (Re - Ru)/Re, where Re is the risk of the outcome in the exposed group and Ru is the risk in the unexposed group. The population attributable risk is defined as PAR = (R - Ru)/R, where R is the overall risk of the outcome.NOTE Note that the ARE can also be written in terms of the relative risk, RR = Re/Ru, so that ARE = (RR-1)/RR. PAR can be written as PAR = ARE·pe, where pe is the proportion of observed events that were in the exposed group.

In retrospective, case-control studies and when the risk of the outcome is small, the attributable risk can be estimated by ARE = (OR-1)/OR, where OR is the odds ratio for exposure. The population attributable risk is then computed as above by PAR = ARE·pe.

Example with covariate using a logistic model

The following example uses the chemical exposure data in the Example titled "Computing Attributable Fraction Estimates" in the PROC STDRATE documentation. The modified version of the Factory data set created by the DATA step below creates separate observations for exposed and nonexposed cases at each age along with their counts. Expose is an exposure indicator variable, where Expose=E represents the group exposed to a chemical agent and Expose=NE represents non exposure. Exposure to the agent is linked to increased probability of the outcome event. Age is a confounding factor which should be adjusted for when assessing the effect of exposure. In the STDRATE documentation example, this adjustment is done by stratifying on the levels of Age. In order to adjust for Age by modeling, the numeric variable AgeMid, with values at the midpoints of the age ranges, is created for use as a continuous covariate.

A logistic model on the outcome is fitted by the following PROC LOGISTIC statements which includes both Expose and AgeMid as predictors. The LSMEANS statement provides estimates of the outcome risk in each exposure group adjusted for age. The E option shows the coefficients of the linear combination of model parameters used to estimate the log odds for each group. The ILINK option applies the inverse logit function to the log odds estimates to produce the risk estimates Re and Ru. The STORE statement saves the fitted model for later use in the NLEst macro.

      data Factory;
         input AgeMid Age $ Event_E Count_E Event_NE Count_NE;
         r=Event_E;  n=Count_E;  Expose='E';  output;
         r=Event_NE; n=Count_NE; Expose='NE'; output;
         datalines;
      25 20-29   31  352  143  2626
      35 30-39   57  486  392  4124
      45 40-49   62  538  459  4662
      55 50-59   50  455  337  3622
      65 60-69   38  322  199  2155
      75 70+      9   68   35   414
      ;
      
      proc logistic data=Factory;
         class Expose / param=glm;
         model r/n = AgeMid Expose;
         lsmeans Expose / e ilink;
         store out=log;
         run;

Below are shown the four parameters of the logistic model and the estimated odds ratios. The adjusted Expose estimate is 0.2444 and its adjusted odds ratio estimate is 1.277. The adjusted risk estimates are provided in the Mean column of the Expose Least Squares Means table – Re = 0.1140 and Ru = 0.0915.

The LOGISTIC Procedure

Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 -2.6406 0.0888 883.7678 <.0001
AgeMid   1 0.00691 0.00185 13.9804 0.0002
Expose E 1 0.2444 0.0725 11.3536 0.0008
Expose N 0 0 . . .

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
AgeMid 1.007 1.003 1.011
Expose E vs N 1.277 1.108 1.472

Coefficients for Expose Least
Squares Means
Parameter Expose Row1 Row2
Intercept   1 1
AgeMid   50 50
Expose E E 1  
Expose N N   1

Expose Least Squares Means
Expose Estimate Standard Error z Value Pr > |z| Mean Standard Error
of Mean
E -2.0507 0.06782 -30.24 <.0001 0.1140 0.006849
N -2.2951 0.02757 -83.26 <.0001 0.09153 0.002292

In the DATA FD step below, the first two rows express the adjusted risk for each group in terms of the linear combination coefficients from the LSMEANS statement (shown in the Coefficients for Expose Least Squares Means table above) which multiply the model parameters (named B_p1 - B_p4). Note that the expressions involve the mean of the AgeMid covariate, 50. The inverse logit function is then applied using the LOGISTIC function. The odds ratio for Expose is recomputed in the third row and is simply the third parameter estimate (Expose), exponentiated. The fourth and fifth rows define the attributable risk and population attributable risk as defined above. PAR is computed as ARE·pe. The observed value of pe is provided by the PROC FREQ step and is found to be 0.1363. Since the outcome event is relatively rare (around 10% as seen above), the attributable risks are also approximated using the odds ratio for exposure in rows 6 and 7 of the data set. The NLEst macro is then called to re-estimate the two group risks, the exposure odds ratio, and to estimate the attributable risk and population attributable risk along with tests and confidence intervals.

The NLEst macro is a general macro for estimating and testing linear or nonlinear combinations of model parameters. Because the inverse logit function or exponentiation is used in the expressions, nonlinear functions must be estimated. The macro is used by providing the saved model from the STORE statement and a data set containing the expressions to be estimated along with labels. The name of the saved model is specified in INSTORE=, and the name of the data set of expressions and labels is specified in FDATA=.

      proc freq data=Factory; 
         table Expose;
         weight r;
         run;

      data fd; 
         length label f $32767; 
         infile datalines delimiter=',';
         input label f; 
         datalines;
      Risk(E)  , logistic(B_p1+50*B_p2+B_p3)
      Risk(NE) , logistic(B_p1+50*B_p2)
      OR (Expose) , exp(B_p3)
      ARE ,( logistic(B_p1+50*B_p2+B_p3) - logistic(B_p1+50*B_p2) ) / ( logistic(B_p1+50*B_p2+B_p3) )
      PAR , 0.1363*( logistic(B_p1+50*B_p2+B_p3) - logistic(B_p1+50*B_p2) ) / ( logistic(B_p1+50*B_p2+B_p3) )
      ARE OR , (exp(B_p3)-1)/exp(B_p3)
      PAR OR , 0.1363*(exp(B_p3)-1)/exp(B_p3)
      ;
      
      %NLEst( instore=log, fdata=fd )

The group risks, Re and Ru, and the exposure odds ratio agree with those in the PROC LOGISTIC results above. The estimated attributable risk, 0.197, and the estimated population attributable risk, 0.027, closely match those obtained by stratification from PROC STDRATE. Additionally, the approximate estimates based on the exposure odds ratio are similar.

These results indicate that nearly 3% of the outcome event is attributable to the exposure to the chemical agent. In other words, the proportion of the outcome event would be reduced by about 3% in the population if exposure to the chemical agent were removed. Among those exposed to the agent, the outcome event would be reduced by about 20%.

The FREQ Procedure

Expose Frequency Percent Cumulative
Frequency
Cumulative
Percent
E 247 13.63 247 13.63
N 1565 86.37 1812 100.00

Nonlinear Function Estimate

Label Estimate Standard Error Wald Chi-Square Pr > ChiSq Alpha Lower Upper
Risk(E) 0.1140 0.006849 276.95 <.0001 0.05 0.1006 0.1274
Risk(NE) 0.09153 0.002292 1594.54 <.0001 0.05 0.08703 0.09602
OR (Expose) 1.2769 0.09262 190.05 <.0001 0.05 1.0953 1.4584
ARE 0.1970 0.05179 14.47 0.0001 0.05 0.09549 0.2985
PAR 0.02685 0.007059 14.47 0.0001 0.05 0.01302 0.04069
ARE OR 0.2168 0.05681 14.57 0.0001 0.05 0.1055 0.3282
PAR OR 0.02956 0.007743 14.57 0.0001 0.05 0.01438 0.04473

Using Zou's Poisson model to estimate the relative risk

Another modeling approach is to use a Poisson GEE model (Zou, 2004) which allows for estimation of the relative risk. This is further illustrated in this note. The Factory data are expanded to have a binary response with separate observations for events (Y=1) and nonevents (Y=0) with frequency counts (Count). An ID variable is added which simply numbers the final observations 1, 2, ... . PROC GENMOD with a REPEATED statement is then used to fit a Poisson GEE model to the binary response with Expose and AgeMid as predictors as before. The LSMEANS statement is again added with the E and ILINK options to provide the risk estimates for each group and the coefficients that define them. The DIFF and EXP options are also added to provide the relative risk estimate. The STORE statement saves the model for later use.

      data Factory2;
         set Factory;
         y=1; Count=r; ID+1; output;
         y=0; Count=n-r; ID+1; output;
         run;
      proc genmod data=Factory2;
         class ID Expose;
         model y = AgeMid Expose / dist=poisson;
         freq Count;
         repeated subject=ID;
         lsmeans Expose / e ilink diff exp;
         store gen;
         run;

Note that the estimated relative risk, 1.246 in the Exponentiated column of the Differences of Expose Least Squares Means table, is similar to the previously estimated odds ratio, 1.277, suggesting that using the odds ratio approximating formula is reasonable.

The GENMOD Procedure

Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates
Parameter   Estimate Standard
Error
95% Confidence Limits Z Pr > |Z|
Intercept   -2.7048 0.0767 -2.8552 -2.5545 -35.26 <.0001
AgeMid   0.0063 0.0016 0.0032 0.0094 3.96 <.0001
Expose E 0.2196 0.0647 0.0929 0.3464 3.40 0.0007
Expose N 0.0000 0.0000 0.0000 0.0000 . .

Coefficients for Expose Least
Squares Means
Parameter Expose Row1 Row2
Intercept   1 1
AgeMid   44.955 44.955
Expose E E 1  
Expose N N   1

Expose Least Squares Means
Expose Estimate Standard Error z Value Pr > |z| Mean Standard Error
of Mean
Exponentiated
E -2.2036 0.06007 -36.68 <.0001 0.1104 0.006633 0.1104
N -2.4232 0.02418 -100.22 <.0001 0.08864 0.002143 0.08864

Differences of Expose Least Squares Means
Expose _Expose Estimate Standard Error z Value Pr > |z| Exponentiated
E N 0.2196 0.06467 3.40 0.0007 1.2456

The LSMEANS coefficients table above shows, for each Expose level, the vector of coefficients that multiplies the vector of model parameters. Since the response function being modeled is log(mean) and the mean is just the probability of the outcome event, the DIFF option comparing the two groups estimates the difference in log(mean) values, or equivalently, the log relative risk. Exponentiating that difference (EXP option) produces an estimate of the relative risk. Equivalently, the difference of the two coefficient vectors can be used to multiply the parameter vector to obtain the log mean difference which can then be exponentiated. The difference of the two group vectors is simply (0 0 1 -1), but since the fourth parameter estimate is zero, the difference of log means is just 1*B_p3 = B_p3, the Expose parameter estimate. The relative risk estimate is then exp(B_p3) and this is used in the first data line in the following DATA step to recompute the relative risk. The next two data lines use the formula involving the relative risk to compute the ARE which is then used to compute the PAR.

      data fd; 
         length label f $32767; 
         infile datalines delimiter=',';
         input label f; 
         datalines;
         RR , exp(B_p3)
         ARE RR , ( (exp(B_p3)-1)/exp(B_p3) )
         PAR RR , ( 0.1363*(exp(B_p3)-1)/exp(B_p3) )
         ;
      %NLEst( instore=gen, fdata=fd )

The resulting estimates for the relative risk, the ARE, and PAR and very close to those directly estimated from the separate group risks as above.

Nonlinear Function Estimate

Label Estimate Standard Error Wald Chi-Square Pr > ChiSq Alpha Lower Upper
RR 1.2456 0.08055 239.139 <.0001 0.05 1.0877 1.4035
ARE RR 0.1972 0.05192 14.424 0.0001 0.05 0.09542 0.2989
PAR RR 0.02687 0.007076 14.424 0.0001 0.05 0.01301 0.04074

__________

References:

Fleiss, J. L., Levin, B., and Paik, M. C. (2003), Statistical Methods for Rates and Proportions, 3d ed. New York: John Wiley & Sons, Inc.

Zou, G. (2004), "A Modified Poisson Regression Approach to Prospective Studies with Binary Data," Am. J. Epidemiol., 159:702-706.


NOTE: In some cases, exposure reduces the probability of the outcome event rather than increases it. For example, exposure to the flu vaccine reduces the probability of getting the flu. In this situation, the attributable benefit in the exposed and population attributable benefit can be computed using similar formulas: ABE = (Ru - Re)/Ru and PAB = (R - Re)/R.










Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS 64-bit
z/OS
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 8 Enterprise 32-bit
Microsoft Windows 8 Enterprise x64
Microsoft Windows 8 Pro 32-bit
Microsoft Windows 8 Pro x64
Microsoft Windows 8.1 Enterprise 32-bit
Microsoft Windows 8.1 Enterprise x64
Microsoft Windows 8.1 Pro 32-bit
Microsoft Windows 8.1 Pro x64
Microsoft Windows 10
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2003 for x64
Microsoft Windows Server 2008
Microsoft Windows Server 2008 R2
Microsoft Windows Server 2008 for x64
Microsoft Windows Server 2012 Datacenter
Microsoft Windows Server 2012 R2 Datacenter
Microsoft Windows Server 2012 R2 Std
Microsoft Windows Server 2012 Std
Microsoft Windows Server 2016
Microsoft Windows XP Professional
Windows 7 Enterprise 32 bit
Windows 7 Enterprise x64
Windows 7 Home Premium 32 bit
Windows 7 Home Premium x64
Windows 7 Professional 32 bit
Windows 7 Professional x64
Windows 7 Ultimate 32 bit
Windows 7 Ultimate x64
Windows Millennium Edition (Me)
Windows Vista
Windows Vista for x64
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.