Example 93.3 Domain Analysis

This example uses a data set from the NHANES I Epidemiologic Followup Study (NHEFS); see Example 93.2 for more information about the NHEFS.

For illustration purposes, 1,891 observations from the 1992 NHEFS vital and tracing status data set are used to estimate the regression coefficients of a proportional hazards model. The observations are obtained from 22 strata; each stratum contains either two or three primary sampling units. The sum of observation weights for these selected units is almost 103 million. Observation weights range from 1,498 to 470,154 with a mean of 54,457.11 and a median of 45,246. The following variables are used in this example. Although this example uses the observation weights directly, Binder (1992) suggests that a scaled version of the observation weights would be useful to improve the performance of the optimization routine.

The following variables are created in the data set mortality:

• ID, unit identification

• VARSTRATA, stratum identification

• VARPSU, identification for primary sampling units

• SWEIGHT, sampling weight associated with each unit

• AGE, the subject’s reported age at the 1992 interview if the subject was alive at that time; otherwise, the subject’s age at death

• VITALSTATUS, vital status of subject in 1992 (1 = alive, 3 = dead, 4 = unknown, 5 = traced alive with direct subject contact, 6 = traced alive without direct subject contact)

• POVARIND, indicator for poverty area where subject’s household was located at NHANES I (1971–1975) exam, (1 = poverty area, 2 = non-poverty area)

• GENDER, (1 = male, 2 = female)

data mortality;
input ID VARSTRATA VARPSU SWEIGHT AGE VITALSTATUS POVARIND GENDER;
datalines;
1  03  1  13312    66  1   1   1
2  03  1   7941    71  3   1   2
3  03  1  16048     .  4   1   1
4  03  3   9298    58  3   1   1
5  03  2  15336    56  3   1   2
6  03  1  14744    63  1   1   1
7  03  2  83729    70  1   2   2
8  03  3 106492    57  1   2   1
9  03  3  78083    81  3   2   2
10  03  3  55957    79  3   2   1

... more lines ...

1890  13  1  88939  59  1  2   1
1891  13  1  59218  75  1  2   2
;

Suppose you want to estimate the hazard function for mortality time after adjusting for the poverty area indicator in the base year survey population. The following SAS statements request a proportional hazards regression of age (AGE) on poverty indicator (POVARIND):

proc surveyphreg data = mortality nomcar;
class povarind;
strata varstrata;
cluster varpsu;
weight sweight;
model age*vitalstatus(1 4 5 6) = povarind;
domain gender;
run;

Subjects with VITALSTATUS 1, 4, 5, or 6 are considered alive. The CLASS statement specifies that POVARIND is a categorical variable, the WEIGHT statement identifies the sampling weights, the STRATA statement identifies variance strata, and the CLUSTER statement identifies variance PSUs. The DOMAIN statement requests three separate analyses: for the overall data set, the male subpopulation, and the female subpopulation respectively. There are 223 observation units with missing values on age. All the units with missing age have vital status 1, 4, 5, or 6. Therefore, these subjects are considered to be alive in the current survey year 1992. Age for every observation unit in the base year survey was known from 1971–1975 NHANES I. One reasonable approach is to determine the age of these 223 units based on their age from the NHANES I data set. However, for illustration purposes, this example does not include the observation units with missing age when estimating the regression coefficients. Instead, an analysis of just the set of respondents is requested by specifying the NOMCAR option in the PROC SURVEYPHREG statement. This option uses a variance estimator that accounts for the random size of the set of respondents.

Output 93.3.1 shows summary statistics for the overall analysis. A total of 1,891 observations are read from the input DATA= data set, but only 1,668 observations are used in the analysis. The remaining 223 observations have missing values in the variable age. The respondent data set represents almost 89.5 million units in the population. There are 22 strata and 55 clusters. Although only 57% observation units in the sample are alive, an estimated 69% observation units in the population are alive. This difference is reasonable because selection probabilities for observation units are not the same. If you do not use the sampling weights, then your sample-based estimators might be biased for the corresponding finite population quantities. The Variance Estimation table indicates that the NOMCAR option is used for variance estimation.

Output 93.3.1: Summary Statistics for the Entire Population

The SURVEYPHREG Procedure

 Number of Observations Read 1891 1668 1.0298e+08 8.94396e+07

Design Summary
Number of Strata 22
Number of Clusters 55

Summary of the Number of Event and Censored
Values
Total Event Censored Percent
Censored
1668 717 951 57.01

Summary of the Weighted Number of Event
and Censored Values
Total Event Censored Percent
Censored
89439590 27650348 61789242 69.08

Variance Estimation
Method Taylor Series
Missing Values NOMCAR

Output 93.3.2 displays the estimated regression coefficients and their standard errors. Poverty index has two levels, and only one level is estimable. By default, PROC SURVEYPHREG estimates the first level (POVARIND 1) and assigns a zero value for the second level. The estimated regression coefficient is 0.385 with a standard error of 0.078. The estimated hazard for the poverty areas is 1.47 times higher than the estimated hazard for the non-poverty areas. The degrees of freedom are equal to the number of PSUs (55) minus the number of strata (22).

Output 93.3.2: Inference for the Entire Population

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error t Value Pr > |t| Hazard
Ratio
POVARIND 1 33 0.384961 0.077586 4.96 <.0001 1.470
POVARIND 2 33 0 . . . 1.000

Output 93.3.3 shows that 813 observation units in the sample are male, and they account for over 42.6 million males in the base year survey population. Approximately half of these observation units in the sample are censored, and an estimated 64.5% observation units are censored for the male subpopulation.

Output 93.3.3: Summary Statistics for the Male Subpopulation

The SURVEYPHREG Procedure

Domain Analysis for domain GENDER=1

 Number of Observations Read 1891 813 48887067 42629905

Summary of the Number of Event and Censored
Values
Total Event Censored Percent
Censored
813 404 409 50.31

Summary of the Weighted Number of Event
and Censored Values
Total Event Censored Percent
Censored
42629905 15126321 27503584 64.52

Output 93.3.4 shows that the estimated regression coefficient for POVARIND 1 is 0.425 with a standard error of 0.157. The estimated hazard for the males in the poverty areas is 1.53 times higher than the estimated hazard for the males in the non-poverty areas. The degrees of freedom for the t significant test for the male subpopulation are equal to the total number of PSUs (55) minus the total number of strata (22).

Output 93.3.4: Inference for the Male Subpopulation

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error t Value Pr > |t| Hazard
Ratio
POVARIND 1 33 0.424922 0.156583 2.71 0.0105 1.529
POVARIND 2 33 0 . . . 1.000

Output 93.3.5 displays some summary statistics for the female subpopulation. There are 855 observation units for females in the sample, and they represent over 46.8 million females in the base year survey population. Although 63.4% females in the sample are alive, an estimated 73.2% females in the subpopulation are alive.

Output 93.3.5: Summary Statistics for the Female Subpopulation

The SURVEYPHREG Procedure

Domain Analysis for domain GENDER=2

 Number of Observations Read 1891 855 54091604 46809685

Summary of the Number of Event and Censored
Values
Total Event Censored Percent
Censored
855 313 542 63.39

Summary of the Weighted Number of Event
and Censored Values
Total Event Censored Percent
Censored
46809685 12524027 34285658 73.24

Output 93.3.6 shows that the estimated proportional hazards regression coefficients for POVARIND for the females subpopulation (0.435) is higher than the estimated proportional hazards regression coefficients for POVARIND for the males subpopulation. The estimated hazard for the females in the poverty areas is 1.54 times higher than the estimated hazard for the females in the non-poverty areas. The degrees of freedom for the t significant test for the female subpopulation are equal to the total number of PSUs (55) minus the total number of strata (22).

Output 93.3.6: Inference for the Female Subpopulation

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error t Value Pr > |t| Hazard
Ratio
POVARIND 1 33 0.434579 0.115766 3.75 0.0007 1.544
POVARIND 2 33 0 . . . 1.000