Random Sampling Task

About the Random Sampling Task

The Random Sampling task is a high-performance procedure that performs either simple random sampling or stratified sampling. The output from this task includes an output data set and the sample data, a table with performance information, and a table with frequency information for the population and sample.

Assigning Data to Roles

If you want to perform stratified sampling, you must assign a column to the Stratify by role. Otherwise, the Stratify by role is optional.
Role
Description
Stratify by
specifies the variables to use to partition the input table into mutually exclusive, nonoverlapping subsets that are known as strata. Each stratum is defined by a set of values of the strata variables, and each stratum is sampled separately. The complete sample is the union of the samples that are taken from all the strata.
Note: If you do not assign any variables to this role, then the entire input table is treated as a single stratum.
You can allocate the total sample size among the strata in proportion to the size of the stratum. For example, the variable GENDER has possible values of M and F, and the variable VOTED has possible values of Y and N. If you assign both GENDER and VOTED to the Stratify by role, then the input table is partitioned into four strata: males who voted, males who did not vote, females who voted, and females who did not vote.
The input table contains 20,000 rows, and the values are distributed as follows:
  • 7,000 males who voted
  • 4,000 males who did not vote
  • 5,000 females who voted
  • 4,000 females who did not vote
Stratify by (continued)
Therefore, the proportion of males who voted is 7,000/20,000=0.35 or 35%. The proportions in the sample should reflect the proportions of the strata in the input table. For example, if your sample table contains 100 observations, then 35% of the values in the sample must be selected from the males who voted stratum to reflect the proportions in the input table.

Creating the Output Data Set

By default, the output data set is saved in the Work library. You can select the numeric and character variables from the input data set to include in the output data. Select the Include all input observations and a sampling indicator variable to produce an output table with the same number of rows as the input table. The output table has an additional partition indicator (_PARTIND_) to indicate whether an observation is included in the sample (1) or not (0).

Setting Options

Option Name
Description
Methods
Sample by
specifies the sample size in the desired number of rows or in the desired percentage of input rows. For example, if you specify 3% of rows and there are 400 input rows, then the resulting sample has 12 rows.
Note: If you assign variables to the Stratify by role, then the sample size specification that you make here applies to each stratum rather than to the entire input table.
Random seed
specifies the initial seed for the generation of random numbers. If you set this value to zero or a negative number, then a seed that is based on the system clock is used to produce the sample.
Ignore case of character stratification values
distinguishes stratified variables that share the same normalized value when you perform stratified sampling. For example, if a target has three distinct values, “A”, “B”, and “b”, and you want to treat “B” and “b” as different levels, you need to select this option. Otherwise, “B” and “b” are treated as the same level. The task normalizes a value as follows:
  1. Leading blanks are removed.
  2. The value is truncated to 32 characters.
  3. Letters are changed from lowercase to uppercase.