The SURVEYREG Procedure

Example 98.7 Domain Analysis

You can use PROC SURVEYREG to perform domain analysis in a subgroup of your interest. To illustrate, this example uses a data set from the National Health and Nutrition Examination Survey I (NHANES I) Epidemiologic Followup Study (NHEFS), described in Example 97.2 in Chapter 97: The SURVEYPHREG Procedure.

The NHEFS is a national longitudinal survey that is conducted by the National Center for Health Statistics, the National Institute on Aging, and some other agencies of the Public Health Service in the United States. Some important objectives of this survey are to determine the relationships between clinical, nutritional, and behavioral factors; to determine the relationship between mortality and hospital utilization; and to monitor changes in risk factors for the initial cohort that represents the NHANES I population. A cohort of size 14,407, which includes all persons 25 to 74 years old who completed a medical examination at NHANES I in 1971–1975, was selected for the NHEFS. Personal interviews were conducted for every selected unit during the first wave of data collection from the year 1982 to 1984. Follow-up studies were conducted in 1986, 1987, and 1992. In the year 1986, only nondeceased persons 55 to 74 years old (as reported in the base year survey) were interviewed. The 1987 and 1992 NHEFS contain the entire nondeceased NHEFS cohort. Vital and tracing status data, interview data, health care facility stay data, and mortality data for all four waves are available for public use. See for more information about the survey and the data sets.

For illustration purposes, 1,018 observations from the 1987 NHEFS public use interview data are used to create the data set cancer. The observations are obtained from 10 strata that contain 596 PSUs. The sum of observation weights for these selected units is over 19 million. Observation weights range from 359 to 129,359 with a mean of 18,747.69 and a median of 11,414.

The following variables are used in this example:

  • ObsNo, unit identification

  • Strata, stratum identification

  • PSU, identification for primary sampling units

  • ObservationWt, sampling weight associated with each unit

  • Age, the event-time variable, defined as follows:

    • age of the subject when the first cancer was reported for subjects with reported cancer

    • age of the subject at death for deceased subjects without reported cancer

    • age of the subject as reported in 1987 follow-up (this value is used for nondeceased subjects who never reported cancer)

    • age of the subject for the entry year 1971–1975 survey if the subject has cancer (or is deceased) but the date of incident is not reported

  • Cancer, cancer indicator (1 = cancer reported, 0 = cancer not reported)

  • BodyWeight, body weight of the subject as reported in the 1987 follow-up, or an imputed body weight based on the subject’s age in the entry year 1971–1975 survey

The following SAS statements create the data set cancer. Note that BodyWeight for a few observations (8%) is imputed based on Age by using a deterministic regression imputation model (Särndal and Lundström (2005, chapter 12)). The imputed values are treated as observed values in this example. In other words, this example treats the data set Cancer as the observed data set.

data cancer;
   input ObsNo Strata PSU ObservationWt Age Cancer BodyWeight;
   1  3  002  3805   53  1  175
   2  3  002  6107   77  0  175
   3  3  039  2968   50  0  160
   4  3  084  30438  52  0  145
   5  3  007  5081   80  0  127
   6  3  009  3891   62  0  180
   7  3  009  3580   50  0  157
   8  3  022  2968   56  0  142
   9  3  050  23748  60  0  140 
  10  3  060  48264  69  0  168

   ... more lines ...   

1016  4  002   2689  40  0  120
1017  4  092  45888  52  0  166
1018  4  035   4347  58  0  156

Suppose you want to study how aging affects body weight in the subgroup of cancer patients for the base year survey population. Because whether an individual has cancer or not is unrelated to the design of the sample, this kind of analysis is called domain analysis (subgroup analysis).

The following statements request a linear regression of BodyWeight on Age among cancer patients. The STRATA, CLUSTER, and WEIGHT statements identify the variance strata, PSUs, and analysis weights, respectively. The DOMAIN statement defines the subgroups of people who have been diagnosed with cancer and people who do not have cancer. The ODS SELECT statement requests that PROC SURVEYREG display only the analysis in the subgroup Cancer = 1 in the output.

title1 'Study of Body Weight and Age among Cancer Patients';
ods graphics on;
proc surveyreg data=cancer plot=fit;
   strata strata;
   cluster psu;
   weight ObservationWt;
   model bodyweight = age;
   domain cancer;
   ods select where=(_labelpath_ ? 'Cancer=1');
ods graphics off;

Output 98.7.1 gives a summary of the data and the parameter estimates of the linear regression in domain Cancer = 1. The analysis indicates that aging does not significantly affect body weight among cancer patients.

Output 98.7.1: Domain Analysis Among Cancer Patients

Study of Body Weight and Age among Cancer Patients

The SURVEYREG Procedure
Domain Regression Analysis for Variable BodyWeight

Domain Summary
Number of Observations 1017
Number of Observations in Domain 119
Number of Observations Not in Domain 898
Sum of Weights in Domain 2211545.0
Weighted Mean of BodyWeight 164.87655
Weighted Sum of BodyWeight 364631909

Estimated Regression Coefficients
Parameter Estimate Standard Error t Value Pr > |t|
Intercept 189.614789 14.9467889 12.69 <.0001
Age -0.398556 0.2398447 -1.66 0.0971

Note: The denominator degrees of freedom for the t tests is 586.

When ODS Graphics is enabled and the model contains a single continuous regressor, PROC SURVEYREG displays a plot of the model fitting, which is shown in Output 98.7.2.

Output 98.7.2: Regression Fitting