This example uses a data set from the NHANES I Epidemiologic Followup Study (NHEFS); see Example 100.2 for more information about the NHEFS.
For illustration purposes, 1,891 observations from the 1992 NHEFS vital and tracing status data set are used to estimate the regression coefficients of a proportional hazards model. The observations are obtained from 22 strata; each stratum contains either two or three primary sampling units. The sum of observation weights for these selected units is almost 103 million. Observation weights range from 1,498 to 470,154 with a mean of 54,457.11 and a median of 45,246. The following variables are used in this example. Although this example uses the observation weights directly, Binder (1992) suggests that a scaled version of the observation weights would be useful to improve the performance of the optimization routine.
The following variables are created in the data set mortality
:
ID
, unit identification
VARSTRATA
, stratum identification
VARPSU
, identification for primary sampling units
SWEIGHT
, sampling weight associated with each unit
AGE
, the subject’s reported age at the 1992 interview if the subject was alive at that time; otherwise, the subject’s age at
death
VITALSTATUS
, vital status of subject in 1992 (1 = alive, 3 = dead, 4 = unknown, 5 = traced alive with direct subject contact, 6 = traced
alive without direct subject contact)
POVARIND
, indicator for poverty area where subject’s household was located at NHANES I (1971–1975) exam, (1 = poverty area, 2 = non-poverty
area)
GENDER
, (1 = male, 2 = female)
data mortality; input ID VARSTRATA VARPSU SWEIGHT AGE VITALSTATUS POVARIND GENDER; datalines; 1 03 1 13312 66 1 1 1 2 03 1 7941 71 3 1 2 3 03 1 16048 . 4 1 1 4 03 3 9298 58 3 1 1 5 03 2 15336 56 3 1 2 6 03 1 14744 63 1 1 1 7 03 2 83729 70 1 2 2 8 03 3 106492 57 1 2 1 9 03 3 78083 81 3 2 2 10 03 3 55957 79 3 2 1 ... more lines ... 1890 13 1 88939 59 1 2 1 1891 13 1 59218 75 1 2 2 ;
Suppose you want to estimate the hazard function for mortality time after adjusting for the poverty area indicator in the
base year survey population. The following SAS statements request a proportional hazards regression of age (AGE
) on poverty indicator (POVARIND
):
proc surveyphreg data = mortality nomcar; class povarind; strata varstrata; cluster varpsu; weight sweight; model age*vitalstatus(1 4 5 6) = povarind; domain gender; run;
Subjects with VITALSTATUS
1, 4, 5, or 6 are considered alive. The CLASS statement specifies that POVARIND
is a categorical variable, the WEIGHT statement identifies the sampling weights, the STRATA statement identifies variance
strata, and the CLUSTER statement identifies variance PSUs. The DOMAIN statement requests three separate analyses: for the
overall data set, the male subpopulation, and the female subpopulation respectively. There are 223 observation units with
missing values on age. All the units with missing age have vital status 1, 4, 5, or 6. Therefore, these subjects are considered
to be alive in the current survey year 1992. Age for every observation unit in the base year survey was known from 1971–1975
NHANES I. One reasonable approach is to determine the age of these 223 units based on their age from the NHANES I data set.
However, for illustration purposes, this example does not include the observation units with missing age when estimating the
regression coefficients. Instead, an analysis of just the set of respondents is requested by specifying the NOMCAR option
in the PROC SURVEYPHREG statement. This option uses a variance estimator that accounts for the random size of the set of respondents.
Output 100.3.1 shows summary statistics for the overall analysis. A total of 1,891 observations are read from the input DATA= data set,
but only 1,668 observations are used in the analysis. The remaining 223 observations have missing values in the variable age
. The respondent data set represents almost 89.5 million units in the population. There are 22 strata and 55 clusters. Although
only 57% observation units in the sample are alive, an estimated 69% observation units in the population are alive. This difference
is reasonable because selection probabilities for observation units are not the same. If you do not use the sampling weights,
then your sample-based estimators might be biased for the corresponding finite population quantities. The "Variance Estimation"
table indicates that the NOMCAR option is used for variance estimation.
Output 100.3.2 displays the estimated regression coefficients and their standard errors. Poverty index has two levels, and only one level
is estimable. By default, PROC SURVEYPHREG estimates the first level (POVARIND 1
) and assigns a zero value for the second level. The estimated regression coefficient is 0.385 with a standard error of 0.078.
The estimated hazard for the poverty areas is 1.47 times higher than the estimated hazard for the non-poverty areas. The degrees
of freedom are equal to the number of PSUs (55) minus the number of strata (22).
Output 100.3.3 shows that 813 observation units in the sample are male, and they account for over 42.6 million males in the base year survey population. Approximately half of these observation units in the sample are censored, and an estimated 64.5% observation units are censored for the male subpopulation.
Output 100.3.4 shows that the estimated regression coefficient for POVARIND 1
is 0.425 with a standard error of 0.157. The estimated hazard for the males in the poverty areas is 1.53 times higher than
the estimated hazard for the males in the non-poverty areas. The degrees of freedom for the t significant test for the male subpopulation are equal to the total number of PSUs (55) minus the total number of strata (22).
Output 100.3.5 displays some summary statistics for the female subpopulation. There are 855 observation units for females in the sample, and they represent over 46.8 million females in the base year survey population. Although 63.4% females in the sample are alive, an estimated 73.2% females in the subpopulation are alive.
Output 100.3.6 shows that the estimated proportional hazards regression coefficients for POVARIND
for the females subpopulation (0.435) is higher than the estimated proportional hazards regression coefficients for POVARIND
for the males subpopulation. The estimated hazard for the females in the poverty areas is 1.54 times higher than the estimated
hazard for the females in the non-poverty areas. The degrees of freedom for the t significant test for the female subpopulation are equal to the total number of PSUs (55) minus the total number of strata
(22).