The SURVEYSELECT Procedure

STRATA Statement

  • STRATA variables < / options> ;

You can specify a STRATA statement to obtain stratified sampling. The STRATA statement names one or more variables that partition the input data set into nonoverlapping groups (strata). The combinations of levels of the STRATA variables define the strata. PROC SURVEYSELECT independently selects samples from the strata according to the selection method and design parameters that you specify in the PROC SURVEYSELECT statement. For information about stratification in sample design, see Lohr (2010), Kalton (1983), Kish (1965), Kish (1987), and Cochran (1977).

The STRATA variables are one or more variables in the DATA= input data set. These variables can be either character or numeric, but PROC SURVEYSELECT treats them as categorical variables. The formatted values of the STRATA variables determine the STRATA variable levels. Thus, you can use formats to group values into levels. For more information, see the FORMAT procedure in the Base SAS Procedures Guide and the FORMAT statement and SAS formats in SAS Formats and Informats: Reference.

The STRATA variables function much like BY variables, and PROC SURVEYSELECT expects the input data set to be sorted by the STRATA variables. The BY statement options DESCENDING and NOTSORTED are available in the STRATA statement. For more information about these BY statement options, see SAS Language Reference: Concepts.

If you specify a CONTROL statement or METHOD=PPS in the PROC SURVEYSELECT statement, the input data set must be sorted by the STRATA variables in ascending order. In this case, you cannot specify the NOTSORTED or DESCENDING option in the STRATA statement.

If your input data set is not sorted by the STRATA variables, use one of the following alternatives:

  • Sort the data by using the SORT procedure with the STRATA variables in a BY statement.

  • Specify the NOTSORTED or DESCENDING option in the STRATA statement (if you do not specify a CONTROL statement or METHOD=PPS in the PROC SURVEYSELECT statement). The NOTSORTED option does not mean that the data are unsorted but rather that the data are arranged in groups (according to values of the STRATA variables) and that these groups are not necessarily in alphabetical or increasing numeric order.

  • Create an index on the STRATA variables by using the DATASETS procedure (in Base SAS software).

For more information about BY-group processing, see the discussion in SAS Language Reference: Concepts. For more information about the DATASETS procedure, see the discussion in the Base SAS Procedures Guide.

Table 115.2 summarizes the options available in the STRATA statement. Descriptions of the options follow in alphabetical order.

Table 115.2: STRATA Statement Options for Sample Allocation

Option

Description

ALLOC=name

Specifies the allocation method

ALLOC=(values)

Provides allocation proportions

ALLOCMIN=

Specifies the minimum sample size per stratum

ALPHA=

Specifies the confidence level for the MARGIN= option

COST=

Provides stratum costs

MARGIN=

Specifies the margin of error

NOSAMPLE

Allocates but does not select the sample

STATS

Displays additional allocation statistics

VAR=

Provides stratum variances


You can specify the following options in the STRATA statement after a slash (/):

ALLOC=name | (values)| SAS-data-set

specifies the allocation method name or specifies the stratum allocation proportions as a list of values or a SAS-data-set. You can use the ALLOC= option with any selection method (which you specify in the PROC SURVEYSELECT statement) except METHOD=PPS_BREWER and METHOD=PPS_MURTHY , either of which selects two units from each stratum.

You can specify the sample size allocation by using one of the following forms:

ALLOC=name

specifies the method for allocating the total sample size among the strata. You can specify one of the following values for name:

NEYMAN

requests Neyman allocation, which allocates the total sample size among the strata in proportion to the stratum sizes and variances. For more information, see the section Neyman Allocation. If you specify ALLOC=NEYMAN, you must provide the stratum variances by also specifying the VAR= option.

OPTIMAL
OPT

requests optimal allocation, which allocates the total sample size among the strata in proportion to the stratum sizes, stratum variances, and stratum costs. For more information, see the section Optimal Allocation. If you specify ALLOC=OPTIMAL, you must provide the stratum variances by also specifying the VAR= option, and you must provide the stratum costs by also specifying the COST= option.

PROPORTIONAL
PROP

requests proportional allocation, which allocates the total sample size in proportion to the stratum sizes, where stratum size is the number of sampling units in the stratum. For more information, see the section Proportional Allocation.

ALLOC=(values)

specifies a list of stratum allocation proportion values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. Each value should correspond to a stratum group, and the number of values must equal the number of strata in the input data set.

A stratum allocation proportion specifies the proportion of the total sample size to allocate to the stratum. The sum of the allocation proportions must be 1 or 100%.

The allocation proportions must be positive numbers. You can specify the proportion values as numbers between 0 and 1. Or you can specify the values in percentage form (as numbers between 1 and 100), and PROC SURVEYSELECT converts the numbers to proportions. PROC SURVEYSELECT treats the value 1 as 100% instead of 1%.

The order of the stratum allocation proportions must match the order of the stratum groups in the DATA= input data set. When you specify a list of proportion values, the input data set must be sorted by the STRATA variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.

ALLOC=SAS-data-set

names a SAS-data-set that contains stratum allocation proportions. You should provide the stratum allocation proportions in the data set variable named _ALLOC_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

A stratum allocation proportion specifies the proportion of the total sample size to allocate to the corresponding stratum. The sum of the allocation proportions must be 1 or 100%.

The allocation proportions must be positive numbers. You can specify the proportion values as numbers between 0 and 1. Or you can specify the values in percentage form (as numbers between 1 and 100), and PROC SURVEYSELECT converts the numbers to proportions. PROC SURVEYSELECT treats the value 1 as 100% instead of 1%.

The ALLOC= data set, which is a secondary input data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the ALLOC= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent between the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary data set in each invocation of PROC SURVEYSELECT.

ALLOCMIN=n

specifies the minimum sample size to allocate to a stratum. If you specify ALLOCMIN=n, PROC SURVEYSELECT allocates at least n sampling units to each stratum.

The minimum stratum sample size n must be a positive integer. The value of n times the number of strata must not exceed the total sample size to be allocated. For without-replacement selection methods, the value of n must not exceed the number of sampling units in any stratum.

By default, PROC SURVEYSELECT allocates at least one sampling unit to each stratum.

ALPHA=$\alpha $

specifies the confidence level that PROC SURVEYSELECT uses in the MARGIN= computations. For more information, see the section Specifying the Margin of Error.

The value of $\alpha $ must be between 0 and 1; a confidence level of $\alpha $ produces a $100(1 - \alpha )$% confidence interval. By default, ALPHA=0.05, which produces a 95% confidence interval.

COST < =values | SAS-data-set >

specifies the stratum-level costs that PROC SURVEYSELECT uses to compute optimal allocation when you specify ALLOC=OPTIMAL . For more information, see the section Optimal Allocation. The stratum costs must be positive numbers. A stratum cost represents the per-unit cost, which is the survey cost of a single unit in the stratum.

You can provide stratum costs by specifying one of the following forms:

COST

indicates that stratum costs are provided in a secondary input data set that you name in another option (for example, the VAR=SAS-data-set option). You should provide the stratum costs in the data set variable named _COST_. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

COST=(values)

specifies a list of stratum cost values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. Each value should correspond to a stratum group, and the number of values must equal the number of strata in the input data set.

The order of the stratum cost values must match the order of the stratum groups in the DATA= input data set. When you specify a list of values, the input data set must be sorted by the STRATA variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.

COST=SAS-data-set

names a SAS-data-set that contains the stratum costs. You should provide the stratum costs in the data set variable named _COST_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the COST= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

MARGIN=value

specifies the margin of error for the estimate of the overall mean from the stratified sample. When you specify this option, PROC SURVEYSELECT determines the stratum sample sizes that achieve the margin value by using the allocation method or proportions that you specify in the ALLOC= option. For more information, see the section Specifying the Margin of Error.

The value must be a positive number. When you specify this option, you must also provide the stratum variances in the VAR= option.

You can use the ALPHA= option to specify the confidence level for the MARGIN= computations. By default, ALPHA=0.05, which produces a 95% confidence interval.

You can specify the MARGIN= option with any allocation method (proportional, optimal, or Neyman) or with allocation proportions that you provide (ALLOC=(values) or ALLOC=SAS-data-set ).

Allocation to achieve a specified margin is an alternative approach to the allocation of a specified total sample size. Therefore, when you specify the MARGIN= option, you cannot also specify a total sample size in the SAMPSIZE= option in the PROC SURVEYSELECT statement.

NOSAMPLE

requests that PROC SURVEYSELECT not select a sample after computing the allocation. When you specify this option, the OUT= output data set contains the stratum sample sizes that PROC SURVEYSELECT computes. For more information, see the section Allocation Output Data Set. (By default, PROC SURVEYSELECT selects a sample after computing the allocation.)

STATS

displays sample allocation statistics. When you specify the MARGIN= option, the STATS option displays the expected margin of error for the allocation. For more information, see the section Specifying the Margin of Error. When you specify ALLOC=OPTIMAL or ALLOC=NEYMAN but do not specify the MARGIN= option, the STATS option displays the expected variance, which is computed from the stratum variances that you provide and the allocated stratum sample sizes. When you specify ALLOC=OPTIMAL , the STATS option also displays the total stratum-level cost, which is computed from the stratum costs that you provide and the allocated stratum sample sizes.

VAR < =values | SAS-data-set >

specifies the stratum variances that PROC SURVEYSELECT uses to compute optimal allocation (ALLOC=OPTIMAL ), Neyman allocation (ALLOC=NEYMAN ), or allocation for a specified margin (MARGIN= ). The stratum variances must be positive numbers.

You can provide stratum variances by specifying one of the following forms:

VAR

indicates that stratum variances are provided in a secondary input data set that you name in another option (for example, the COST=SAS-data-set option). You should provide the stratum variances in the data set variable named _VAR_. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.

VAR=(values)

specifies a list of stratum variance values. You can separate the values with blanks or commas, and you must enclose the list of values in parentheses. Each value should correspond to a stratum group, and the number of values must equal the number of strata in the input data set.

The order of the stratum variance values must match the order of the stratum groups in the DATA= input data set. When you specify a list of values, the input data set must be sorted by the STRATA variables in ascending order; you cannot use the DESCENDING or NOTSORTED option in the STRATA statement.

VAR=SAS-data-set

names a SAS-data-set that contains the stratum variances. You should provide the stratum variances in the data set variable named _VAR_. Each observation in the data set should correspond to a stratum group, which is determined by the values of the STRATA variables.

This data set, which is a secondary data set, must contain all stratification variables that you specify in the STRATA statement. The data set must also contain all stratum groups that appear in the DATA= input data set. The order of the stratum groups in the VAR= data set must match the order of the groups in the DATA= data set. If formats are associated with the STRATA variables, the formats must be consistent in the two data sets. For more information, see the section Secondary Input Data Set. You can name only one secondary input data set in each invocation of PROC SURVEYSELECT.