Assigning Data to Roles

For the Select Random Sample task, you must select an input data source. To filter the input data source, click Filter Icon.
To run the task, you must specify a sample size for the output table. No roles are required to run the task.
Role
Description
Roles
Stratify by
specifies the variables to use to partition the input table into mutually exclusive, non-overlapping subsets that are known as strata. Each stratum is defined by a set of values of the strata variables, and each stratum is sampled separately. The complete sample is the union of the samples that are taken from all the strata.
Note: If you do not assign any variables to this role, then the entire input table is treated as a single stratum.
This example shows how the total sample size among the strata is in proportion to the size of the stratum. For this example, the variable GENDER has possible values of M and F, and the variable VOTED has possible values of Y and N. If you assign both GENDER and VOTED to the Strata columns role, then the input table is partitioned into four strata: males who voted, males who did not vote, females who voted, and females who did not vote.
The input table contains 20,000 rows, and the values are distributed as follows:
  • 7,000 males who voted
  • 4,000 males who did not vote
  • 5,000 females who voted
  • 4,000 females who did not vote
Therefore, the proportion of males who voted is 7,000/20,000=0.35 or 35%. The proportions in the sample should reflect the proportions of the strata in the input table. For example, if your sample table contains 100 observations, then 35% of the values in the sample must be selected from the males who voted stratum to reflect the proportions in the input table.
Output Data Set
Data set name
specifies the name of the output data set.
Include all variables in the output data set
specifies the variables to include in the output table. By default, all variables are included in the output table. However, you can select the variables to include in the output.
Show Output Data
Show
specifies whether to include all of the data in the output data set or only a portion of the data.