The SURVEYPHREG Procedure

Notation and Estimation

Let $U=\{ 1,2,\ldots ,N\} $ be the set of indices and let $\mathcal{F}_ N$ be the set of values for a finite population of size N. The survival time of each member of the finite population is assumed to follow its own hazard function, $\lambda _{i}(t)$, expressed as

\[  \lambda _{i}(t)=\lambda (t;{\bZ }_{i}(t)) = {\lambda _0}(t) \  \exp ({\bZ }’_{i}(t)\bbeta )  \]

where $\lambda _{0}(t)$ is an arbitrary and unspecified baseline hazard function, ${\bZ }_ i(t)$ is the vector of explanatory variables for the ith unit at time t, and $\bbeta $ is the vector of unknown regression parameters that are associated with the explanatory variables. The vector $\bbeta $ is assumed to be the same for all individuals.

The partial likelihood function introduced by Cox (1972, 1975) eliminates the unknown baseline hazard $\lambda _{0}(t)$ and accounts for censored survival times. If the entire population is observed, then this partial likelihood can be used to estimate $\bbeta $. Let $\bbeta _ N$ be the desired estimator. Assuming a working model with uncorrelated responses, $\bbeta _ N$ is obtained by maximizing the partial log likelihood,

\[  l(\bbeta ) = \sum _{i \in U} \log \biggl \{  \text {L}(\bbeta ; {\bZ }_ i(t), t_ i) \biggr \}   \]

with respect to $\bbeta $, where L$(\bbeta ;{\bZ }_ i(t),t_ i)$ is Cox’s partial likelihood function.

Assume that probability sample A is selected from the finite population U and $\pi _ i$ is the selection probability for unit i. Further assume that covariates $\bZ _ i(t)$ and survival time $t_ i$ are available for every unit in the sample A. An estimator of the finite population log likelihood is

\[  l_\pi (\bbeta ) = \sum _{i \in A} \pi _ i^{-1} \log \biggl \{  \text {L}(\bbeta ; {\bZ }_ i(t), t_ i) \biggr \}   \]

See Partial Likelihood Function for the Cox Model for more details.

A sample-based estimator $\hat{\bbeta }$ for the finite population quantity $\bbeta _ N$ can be obtained by maximizing the partial pseudo-log-likelihood $l_\pi (\bbeta ;{\bZ }_ i(t),t_ i)$ with respect to $\bbeta $. The design-based variance for $\hat\bbeta $ is obtained by assuming the set of finite population values $\mathcal{F}_ N$ as fixed. For more information about maximum pseudo-likelihood estimators and other inferential approaches for survey data, see Kish and Frankel (1974); Godambe and Thompson (1986); Pfeffermann (1993), Korn and Graubard (1999, chapter 3), Chambers and Skinner (2003, chapter 2), and Fuller (2009, section 6.5). Maximum pseudo-likelihood estimators and their properties for Cox’s proportional hazards model for survey data are discussed in Binder (1990, 1992); Lin and Wei (1989); Lin (2000); Boudreau and Lawless (2006).

Without loss of generality, the rest of this section uses indices for stratified clustered designs. For a stratified clustered sample design, observations are represented by a matrix

\[  (\mb {w, t,} \bDelta \mb {, Z}) = (w_{hij}, t_{hij}, \Delta _{hij}, \mb {z}_{hij})  \]

where

  • $\mb {w}$ denotes the vector of sampling weights

  • $\mb {t}$ denotes the event time variable

  • $\bDelta $ denotes the event indicator

  • $\mb {Z}$ denotes the $n\times p$ matrix of auxiliary information

  • $h=1, 2, \ldots , H$ is the stratum index

  • $i=1, 2, \ldots , n_ h$ is the cluster index within stratum h

  • $j=1, 2, \ldots , m_{hi}$ is the unit index within cluster i of stratum h

  • p is the total number of parameters

  • $n=\sum _{h=1}^ H \sum _{i=1}^{n_ h} {m_{hi}}$   is the total number of observations in the sample

  • $y_{hij}(t) = I(t_{hij} \ge t)$, where $I(\cdot )$ is an indicator function

  • $n_{hij}(t) = I(t_{hij} \le t)$, where $I(\cdot )$ is an indicator function

Let $\sum _ B = \sum _{(h,i,j) \in B}$ denote the summation over the set of indices such that the observation unit j in PSU i and stratum h belongs to the index set B. Typically, B is the set of all population indices that are in the sample, the risk set, or the set of all units with a failure.

The first-stage sampling rate (fraction of PSUs selected for the sample) is denoted by $f_ h$. The first-stage sampling rate is used in Taylor series variance estimation. You can specify the stratum sampling rates with the RATE= option. Or if you specify population totals with the TOTAL= option, PROC SURVEYFREQ computes $f_ h$ as the ratio of stratum sample size to the stratum total, in terms of PSUs. See the section Population Totals and Sampling Rates for details. If you do not specify the RATE= option or the TOTAL= option, then the procedure assumes that the stratum sampling rates $f_ h$ are negligible and does not use a finite population correction when computing variances.