The SURVEYSELECT Procedure

Sample Output Data Set

PROC SURVEYSELECT selects a sample and creates a SAS data set that contains the sample of selected units unless you specify the NOSAMPLE option in the STRATA statement or the GROUPS= option in the PROC SURVEYSELECT statement. When you specify the NOSAMPLE option, PROC SURVEYSELECT allocates the total sample size among strata but does not select a sample; the output data set contains the allocated sample sizes. For more information, see the section Allocation Output Data Set. When you specify the GROUPS= option, PROC SURVEYSELECT randomly assigns observations to groups and does not select a sample. For more information, see the section Random Assignment Output Data Set.

You can specify the name of the sample output data set in the OUT= option in the PROC SURVEYSELECT statement. If you omit the OUT= option, the data set is named DATAn, where n is the smallest integer that makes the name unique.

The output data set contains the units that are selected for the sample. These units are either observations or groups of observations (clusters) that you define by specifying the SAMPLINGUNIT statement. If you do not specify the SAMPLINGUNIT statement to define units (clusters), then PROC SURVEYSELECT uses observations as sampling units by default.

By default, the output data set contains only those units that are selected for the sample. But if you specify the OUTALL option, the output data set includes all observations from the input data set and also contains a variable that indicates each observation’s selection status. The variable Selected equals 1 for an observation selected for the sample, and equals 0 for an observation not selected. The OUTALL option is available for equal probability selection methods.

By default, the output data set contains a single copy of each selected unit, even if the unit is selected more than once, and the variable NumberHits records the number of hits (selections) for each unit. A unit can be selected more than once if you use a with-replacement or with-minimum-replacement selection method (METHOD=URS, METHOD=PPS_WR, METHOD=PPS_SYS, or METHOD=PPS_SEQ). If you specify the OUTHITS option, the output data set includes a distinct copy of each selected unit in the output data set. For example, with the OUTHITS option a unit that is selected three times is represented by three copies in the output data set.

The output data set also contains design information and selection statistics, depending on the selection method and output options you specify. The output data set can include the following variables:

  • Selected, which indicates whether or not the observation is selected for the sample. This variable is included if you specify the OUTALL option. Selected equals 1 for an observation that is selected for the sample, or 0 for an observation that is not selected.

  • STRATA variables, which you specify in the STRATA statement.

  • Replicate, which is the sample replicate number. This variable is included when you request replicated sampling with the REPS= option.

  • SAMPLINGUNIT (CLUSTER) variables, which you specify in the SAMPLINGUNIT statement.

  • ID variables, which you name in the ID statement.

  • CONTROL variables, which you specify in the CONTROL statement.

  • Zone, which is the selection zone. This variable is included for METHOD=PPS_SEQ.

  • SIZE variable, which you specify in the SIZE statement.

  • AdjustedSize, which is the adjusted size measure. This variable is included if you request adjusted sizes with the MINSIZE= or MAXSIZE= option when your sampling units are observations.

  • UnitSize, which is the sampling unit (or cluster) size measure. This variable is included if you specify the SAMPLINGUNIT statement.

  • Certain, which indicates certainty selection. This variable is included if you specify the CERTSIZE= or CERTSIZE=P= option. Certain equals 1 for units that are included with certainty because their size measures exceed the certainty size value or the certainty proportion; otherwise, Certain equals 0.

  • NumberHits, which is the number of hits (selections). This variable is included for selection methods that are with replacement or with minimum replacement (METHOD=URS, METHOD=PPS_WR, METHOD=PPS_SYS, and METHOD=PPS_SEQ).

The output data set includes the following variables if you request a PPS selection method or if you specify the STATS option in the PROC SURVEYSELECT statement for other methods:

  • ExpectedHits, which is the expected number of hits (selections). This variable is included for selection methods that are with replacement or with minimum replacement, where the same unit can be selected more than once (METHOD=URS, METHOD=PPS_WR, METHOD=PPS_SYS, and METHOD=PPS_SEQ).

  • SelectionProb, which is the probability of selection. This variable is included for selection methods that are without replacement.

  • SamplingWeight, which is the sampling weight. This variable equals the inverse of ExpectedHits or SelectionProb.

If you specify the STATS or OUTSIZE option for METHOD=BERNOULLI, the output data set contains the following variables. If you specify a STRATA statement, the output data set includes stratum-level values of these variables; otherwise, the output data set includes overall values.

  • Total, which is the total number of sampling units

  • SelectionProb, which is the selection probability that you specify by using the SAMPRATE= option

  • ExpectedN, which is the expected value of the sample size

  • SampleSize, which is the actual sample size

If you specify the STATS option for METHOD=BERNOULLI, the output data set also contains the following variable:

  • AdjSamplingWeight, which is the adjusted sampling weight

For METHOD=PPS_BREWER and METHOD=PPS_MURTHY, either of which selects two units from each stratum with probability proportional to size, the output data set contains the following variable:

  • JtSelectionProb, which is the joint probability of selection for the two units selected from the stratum.

If you specify the JTPROBS option to compute joint probabilities of selection for METHOD=PPS or METHOD=PPS_SAMPFORD, then the output data set contains the following variables:

  • Unit, which is an identification variable that numbers the selected units sequentially within each stratum.

  • JtProb_1, JtProb_2, JtProb_3, …, where the variable JtProb_1 contains the joint probability of selection for the current unit and unit 1. Similarly, JtProb_2 contains the joint probability of selection for the current unit and unit 2, and so on.

If you specify the JTPROBS option for METHOD=PPS_WR, then the output data set contains the following variables:

  • Unit, which is an identification variable that numbers the selected units sequentially within each stratum.

  • JtHits_1, JtHits_2, JtHits_3, …, where the variable JtHits_1 contains the joint expected number of hits for the current unit and unit 1. Similarly, JtHits_2 contains the joint expected number of hits for the current unit and unit 2, and so on.

If you specify the OUTSIZE option, the output data set contains the following variables. If you specify a STRATA statement, the output data set includes stratum-level values of these variables; otherwise, the output data set includes overall values.

  • MinimumSize, which is the minimum size measure specified with the MINSIZE= option. This variable is included if you specify the MINSIZE= option.

  • MaximumSize, which is the maximum size measure specified with the MAXSIZE= option. This variable is included if you specify the MAXSIZE= option.

  • CertaintySize, which is the certainty size measure specified with the CERTSIZE= option. This variable is included if you specify the CERTSIZE= option.

  • CertaintyProp, which is the certainty proportion specified with the CERTSIZE=P= option. This variable is included if you specify the CERTSIZE=P= option.

  • Total, which is the total number of sampling units in the stratum. This variable is included if there is no SIZE statement, or if you specify a SAMPLINGUNIT statement.

  • TotalSize, which is the total of size measures in the stratum. This variable is included if there is a SIZE statement, or if you specify the PPS option in the SAMPLINGUNIT statement.

  • TotalAdjSize, which is the total of adjusted size measures in the stratum. This variable is included if you request adjusted sizes with the MAXSIZE= or MINSIZE= option.

  • SamplingRate, which is the sampling rate. This variable is included if you specify the SAMPRATE= option.

  • SampleSize, which is the sample size. This variable is included if you specify the SAMPSIZE= option, or if you specify METHOD=PPS_BREWER or METHOD=PPS_MURTHY, either of which selects two units from each stratum.

  • Interval, which is the specified systematic interval. This variable is included if you specify the INTERVAL= option for METHOD=SYS or METHOD=PPS_SYS.

  • NCertain, which is the number of certainty units. This variable is included if you specify the CERTSIZE= or CERTSIZE=P= option and CERTUNITS=OUTPUT.

If you specify the OUTSEED option, the output data set contains the following variable:

  • InitialSeed, which is the initial seed for the stratum.

If you specify the ALLOC= option in the STRATA statement, the output data set contains the following variables:

  • Total, which is the total number of sampling units in the stratum.

  • Variance, which is the stratum variance. This variable is included if you specify the VAR, VAR=(values), or VAR=SAS-data-set option for the ALLOC=OPTIMAL, ALLOC=NEYMAN, or MARGIN= allocation option.

  • Cost, which is the stratum cost. This variable is included if you specify the COST, COST=(values), or COST=SAS-data-set option for ALLOC=OPTIMAL.

  • AllocProportion, which is the target allocation proportion (the proportion of the total sample size to allocate to the stratum). PROC SURVEYSELECT computes this proportion by using the specified allocation method.

  • SampleSize, which is the sample size allocated to the stratum.

  • ActualProportion, which is the actual proportion allocated to the stratum. The value of ActualProportion equals the allocated stratum sample size divided by the total sample size. This value can differ from the target AllocProportion due to rounding and other restrictions. See the section Sample Size Allocation for details.