The SURVEYSELECT Procedure

PPS Sampling without Replacement

If you specify the METHOD=PPS option, PROC SURVEYSELECT selects units with probability proportional to size and without replacement. The selection probability for unit i in stratum h equals $n_ h Z_{hi}$, where $n_ h$ is the sample size for stratum h, and $Z_{hi}$ is the relative size of unit i in stratum h. The relative size equals $M_{hi} / M_{h \cdot }$, which is the ratio of the size measure for unit i in stratum h ($M_{hi}$) to the total of all size measures for stratum h ($M_{h \cdot }$).

Because selection probabilities cannot exceed 1, the relative size for each unit must not exceed $1/n_ h$ for METHOD=PPS. This requirement can be expressed as $Z_{hi} \leq 1/n_ h$, or equivalently, $M_{hi} \leq M_{h \cdot } / n_ h$. If your size measures do not meet this requirement, you can adjust the size measures by using the MAXSIZE= or MINSIZE= option. Or you can request certainty selection for the larger units by using the CERTSIZE= or CERTSIZE=P= option. Alternatively, you can use a selection method that does not have this relative size restriction, such as PPS with minimum replacement (METHOD=PPS_SEQ).

PROC SURVEYSELECT uses the Hanurav-Vijayan algorithm for PPS selection without replacement. Hanurav (1967) introduced this algorithm for the selection of two units per stratum, and Vijayan (1968) generalized it for the selection of more than two units. The algorithm enables computation of joint selection probabilities and provides joint selection probability values that usually ensure nonnegativity and stability of the Sen-Yates-Grundy variance estimator. For details, see Fox (1989); Golmant (1990); Watts (1991).

Notation in the remainder of this section drops the stratum subscript h for simplicity, but selection is still done independently within strata if you specify a stratified design. For a stratified design, n now denotes the sample size for the current stratum, N denotes the stratum population size, and $M_ i$ denotes the size measure for unit i in the stratum. If the design is not stratified, this notation applies to the entire sampling frame.

According to the Hanurav-Vijayan algorithm, PROC SURVEYSELECT first orders units within the stratum in ascending order by size measure, so that $M_1 \leq M_2 \leq \ldots \leq M_ N$. Then the procedure selects the PPS sample of n observations as follows:

  1. The procedure randomly chooses one of the integers $ 1, 2, \ldots , n$ with probability $\theta _1, \theta _2, \ldots , \theta _ n$, where

    \[  \theta _ i = n (Z_{N-n+i+1} - Z_{N-n+i}) (T + i Z_{N-n+1}) / T  \]

    where $Z_ j = M_ j / M$ and

    \[  T = \sum _{j=1}^{N-n} Z_ j  \]

    By definition, $Z_{N+1} = 1/n$ to ensure that $\sum _{i=1}^{n} \theta _ i = 1$.

  2. If i is the integer selected in step 1, the procedure includes the last $(n-i)$ units of the stratum in the sample, where the units are ordered by size measure as described previously. The procedure then selects the remaining i units according to steps 3 through 6.

  3. The procedure defines new normed size measures for the remaining $(N-n+i)$ stratum units that were not selected in steps 1 and 2:

    \begin{equation*}  Z_ j^{\ast } = \begin{cases}  Z_ j / (T + i Z_{N-n+1}) &  \mr {for} \hspace{.1in} j = 1, \ldots , N-n+1 \\ Z_{N-n+1} / (T + i Z_{N-n+1}) &  \mr {for} \hspace{.1in} j = N-n+2, \ldots , N-n+i \\ \end{cases}\end{equation*}
  4. The procedure selects the next unit from the first $(N-n+1)$ stratum units with probability proportional to $a_ j(1)$, where

    \[  \begin{array}{lll} a_1(1) & =&  i Z_1^{\ast } \\[0.10in] a_ j(1) & =&  i Z_ j^{\ast } \prod _{k=1}^{j-1} \bigl (1 - (i-1)~ P_ k \bigr ) \quad \mr {for} \hspace{.1in} j=2,\ldots ,N-n+1 \end{array}  \]

    and

    \[  P_ k = M_ k / ( M_{k+1} + M_{k+2} + \cdots + M_{N-n+i} )  \]
  5. If stratum unit $j_1$ is the unit selected in step 4, then the procedure selects the next unit from units $(j_1+1)$ through $(N-n+2)$ with probability proportional to $a_ j(2,j_1)$, where

    \[  a_{j_1+1}(2,j_1) = (i-1) Z_{j_1+1}^{\ast }  \]
    \[  a_ j(2,j_1) = (i-1) Z_ j^{\ast } \prod _{k=j_1+1}^{j-1} \bigl ( 1 - (i-2) P_ k \bigr ) \quad \mr {for} \hspace{.1in} j = j_1+2,\ldots ,N-n+2  \]
  6. The procedure repeats step 5 until all n sample units are selected.

If you specify the JTPROBS option, PROC SURVEYSELECT computes the joint selection probabilities for all pairs of selected units in each stratum. The joint selection probability for units i and j in the stratum equals

\[  P_{(ij)} = \sum _{r=1}^{n} \theta _ r K_{ij}^{(r)}  \]

where

\begin{equation*}  K_{ij} = \begin{cases}  1 &  N-n+r < i \leq N-1 \\ r Z_{N-n+1} / (T + r Z_{N-n+1}) &  N-n < i \leq N-n+r, ~ ~ ~  j > N-n+r \\ r Z_ i / (T + r Z_{N-n+1}) &  1 \leq i \leq N-n, ~ ~ ~  j > N-n+r \\ \pi _{ij}^{(r)} &  j \leq N-n+r \\ \end{cases}\end{equation*}
\[  \pi _{ij}^{(r)} = \frac{r(r-1)}{2} P_ i Z_ j \prod _{k=1}^{i-1} (1-P_ k)  \]
\[  P_ k = M_ k / ( M_{k+1} + M_{k+2} + \cdots + M_{N-n+r} )  \]