The SURVEYPHREG Procedure

Example 97.3 Domain Analysis

This example uses a data set from the NHANES I Epidemiologic Followup Study (NHEFS); see Example 97.2 for more information about the NHEFS.

For illustration purposes, 1,891 observations from the 1992 NHEFS vital and tracing status data set are used to estimate the regression coefficients of a proportional hazards model. The observations are obtained from 22 strata; each stratum contains either two or three primary sampling units. The sum of observation weights for these selected units is almost 103 million. Observation weights range from 1,498 to 470,154 with a mean of 54,457.11 and a median of 45,246. The following variables are used in this example. Although this example uses the observation weights directly, Binder (1992) suggests that a scaled version of the observation weights would be useful to improve the performance of the optimization routine.

The following variables are created in the data set mortality:

ID, unit identification
VARSTRATA, stratum identification
VARPSU, identification for primary sampling units
SWEIGHT, sampling weight associated with each unit
AGE, the subject’s reported age at the 1992 interview if the subject was alive at that time; otherwise, the subject’s age at death
VITALSTATUS, vital status of subject in 1992 (1 = alive, 3 = dead, 4 = unknown, 5 = traced alive with direct subject contact, 6 = traced alive without direct subject contact)
POVARIND, indicator for poverty area where subject’s household was located at NHANES I (1971–1975) exam, (1 = poverty area, 2 = non-poverty area)
GENDER, (1 = male, 2 = female)

data mortality;
   input ID VARSTRATA VARPSU SWEIGHT AGE VITALSTATUS POVARIND GENDER;
   datalines;
      1  03  1  13312    66  1   1   1
      2  03  1   7941    71  3   1   2
      3  03  1  16048     .  4   1   1
      4  03  3   9298    58  3   1   1
      5  03  2  15336    56  3   1   2
      6  03  1  14744    63  1   1   1
      7  03  2  83729    70  1   2   2
      8  03  3 106492    57  1   2   1
      9  03  3  78083    81  3   2   2
     10  03  3  55957    79  3   2   1

   ... more lines ...   

   1890  13  1  88939  59  1  2   1
   1891  13  1  59218  75  1  2   2
;

Suppose you want to estimate the hazard function for mortality time after adjusting for the poverty area indicator in the base year survey population. The following SAS statements request a proportional hazards regression of age (AGE) on poverty indicator (POVARIND):

proc surveyphreg data = mortality nomcar;
   class povarind;
   strata varstrata;
   cluster varpsu;
   weight sweight;
   model age*vitalstatus(1 4 5 6) = povarind;
   domain gender;
run;

Subjects with VITALSTATUS 1, 4, 5, or 6 are considered alive. The CLASS statement specifies that POVARIND is a categorical variable, the WEIGHT statement identifies the sampling weights, the STRATA statement identifies variance strata, and the CLUSTER statement identifies variance PSUs. The DOMAIN statement requests three separate analyses: for the overall data set, the male subpopulation, and the female subpopulation respectively. There are 223 observation units with missing values on age. All the units with missing age have vital status 1, 4, 5, or 6. Therefore, these subjects are considered to be alive in the current survey year 1992. Age for every observation unit in the base year survey was known from 1971–1975 NHANES I. One reasonable approach is to determine the age of these 223 units based on their age from the NHANES I data set. However, for illustration purposes, this example does not include the observation units with missing age when estimating the regression coefficients. Instead, an analysis of just the set of respondents is requested by specifying the NOMCAR option in the PROC SURVEYPHREG statement. This option uses a variance estimator that accounts for the random size of the set of respondents.

Output 97.3.1 shows summary statistics for the overall analysis. A total of 1,891 observations are read from the input DATA= data set, but only 1,668 observations are used in the analysis. The remaining 223 observations have missing values in the variable age. The respondent data set represents almost 89.5 million units in the population. There are 22 strata and 55 clusters. Although only 57% observation units in the sample are alive, an estimated 69% observation units in the population are alive. This difference is reasonable because selection probabilities for observation units are not the same. If you do not use the sampling weights, then your sample-based estimators might be biased for the corresponding finite population quantities. The “Variance Estimation” table indicates that the NOMCAR option is used for variance estimation.

Output 97.3.1: Summary Statistics for the Entire Population

The SURVEYPHREG Procedure

Number of Observations Read	1891
Number of Observations Used	1668
Sum of Weights Read	1.0298E8
Sum of Weights Used	89439590

Design Summary
Number of Strata	22
Number of Clusters	55

Summary of the Number of Event and Censored Values
Total	Event	Censored	Percent Censored
1668	717	951	57.01

Summary of the Weighted Number of Event and Censored Values
Total	Event	Censored	Percent Censored
89439590	27650348	61789242	69.08

Variance Estimation
Method	Taylor Series
Missing Values	NOMCAR

Output 97.3.2 displays the estimated regression coefficients and their standard errors. Poverty index has two levels, and only one level is estimable. By default, PROC SURVEYPHREG estimates the first level (POVARIND 1) and assigns a zero value for the second level. The estimated regression coefficient is 0.385 with a standard error of 0.078. The estimated hazard for the poverty areas is 1.47 times higher than the estimated hazard for the non-poverty areas. The degrees of freedom are equal to the number of PSUs (55) minus the number of strata (22).

Output 97.3.2: Inference for the Entire Population

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	t Value	Pr > \|t\|	Hazard Ratio
POVARIND 1	33	0.384961	0.077586	4.96	<.0001	1.470
POVARIND 2	33	0	.	.	.	1.000

Output 97.3.3 shows that 813 observation units in the sample are male, and they account for over 42.6 million males in the base year survey population. Approximately half of these observation units in the sample are censored, and an estimated 64.5% observation units are censored for the male subpopulation.

Output 97.3.3: Summary Statistics for the Male Subpopulation

The SURVEYPHREG Procedure

Domain Analysis for domain GENDER=1

Number of Observations Read	1891
Number of Observations Used	813
Sum of Weights Read	48887067
Sum of Weights Used	42629905

Summary of the Number of Event and Censored Values
Total	Event	Censored	Percent Censored
813	404	409	50.31

Summary of the Weighted Number of Event and Censored Values
Total	Event	Censored	Percent Censored
42629905	15126321	27503584	64.52

Output 97.3.4 shows that the estimated regression coefficient for POVARIND 1 is 0.425 with a standard error of 0.157. The estimated hazard for the males in the poverty areas is 1.53 times higher than the estimated hazard for the males in the non-poverty areas. The degrees of freedom for the t significant test for the male subpopulation are equal to the total number of PSUs (55) minus the total number of strata (22).

Output 97.3.4: Inference for the Male Subpopulation

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	t Value	Pr > \|t\|	Hazard Ratio
POVARIND 1	33	0.424922	0.156583	2.71	0.0105	1.529
POVARIND 2	33	0	.	.	.	1.000

Output 97.3.5 displays some summary statistics for the female subpopulation. There are 855 observation units for females in the sample, and they represent over 46.8 million females in the base year survey population. Although 63.4% females in the sample are alive, an estimated 73.2% females in the subpopulation are alive.

Output 97.3.5: Summary Statistics for the Female Subpopulation

The SURVEYPHREG Procedure

Domain Analysis for domain GENDER=2

Number of Observations Read	1891
Number of Observations Used	855
Sum of Weights Read	54091604
Sum of Weights Used	46809685

Summary of the Number of Event and Censored Values
Total	Event	Censored	Percent Censored
855	313	542	63.39

Summary of the Weighted Number of Event and Censored Values
Total	Event	Censored	Percent Censored
46809685	12524027	34285658	73.24

Output 97.3.6 shows that the estimated proportional hazards regression coefficients for POVARIND for the females subpopulation (0.435) is higher than the estimated proportional hazards regression coefficients for POVARIND for the males subpopulation. The estimated hazard for the females in the poverty areas is 1.54 times higher than the estimated hazard for the females in the non-poverty areas. The degrees of freedom for the t significant test for the female subpopulation are equal to the total number of PSUs (55) minus the total number of strata (22).

Output 97.3.6: Inference for the Female Subpopulation

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	t Value	Pr > \|t\|	Hazard Ratio
POVARIND 1	33	0.434579	0.115766	3.75	0.0007	1.544
POVARIND 2	33	0	.	.	.	1.000