Random Sample Task :: SAS(R) Studio 3.3: User's Guide

About the Random Sample Task

The Random Sample task creates an output table that contains a random sample of the rows in the input table.

You might use this task when you need a subset of the data. For example, suppose you want to audit employee travel expenses in an effort to improve the expense reporting procedure and possibly reduce expenses. Because you do not have the resources to examine all expense reports, you can use statistical sampling to objectively select expense reports for audit.

Example: Creating a Random Sample of the Sashelp.Pricedata Data Set

In this example, you want to create a subset of the data in the Sashelp.Pricedata data set.

To create this example:

In the Tasks section, expand the Data folder and double-click Random Sample. The user interface for the Random Sample task opens.
On the Data tab, select the SASHELP.PRICEDATA data set.
To run the task, click .

Here are the tabular results:

The task also creates a sample data set in the Work library. In SAS Studio, this data set opens on the WORK.RandomSample tab.

Ten Sample Rows from the Sashelp.Pricedata Data Set

Assigning Data to Roles

For the Random Sample task, you must specify an input data source. No roles are required to run the task.

Role	Description
Output columns	specifies the variables to include in the output table. By default, all variables are included in the output table. However, you can select the variables to include in the output.
Strata columns	specifies the variables to use to partition the input table into mutually exclusive, nonoverlapping subsets that are known as strata. Each stratum is defined by a set of values of the strata variables, and each stratum is sampled separately. The complete sample is the union of the samples that are taken from all the strata. Note: If you do not assign any variables to this role, then the entire input table is treated as a single stratum. You can allocate the total sample size among the strata in proportion to the size of the stratum. For example, the variable GENDER has possible values of M and F, and the variable VOTED has possible values of Y and N. If you assign both GENDER and VOTED to the Strata columns role, then the input table is partitioned into four strata: males who voted, males who did not vote, females who voted, and females who did not vote. The input table contains 20,000 rows, and the values are distributed as follows: 7,000 males who voted 4,000 males who did not vote 5,000 females who voted 4,000 females who did not vote Therefore, the proportion of males who voted is 7,000/20,000=0.35 or 35%. The proportions in the sample should reflect the proportions of the strata in the input table. For example, if your sample table contains 100 observations, then 35% of the values in the sample must be selected from the males who voted stratum to reflect the proportions in the input table.

Role

Description

Output columns

specifies the variables to include in the output table. By default, all variables are included in the output table. However, you can select the variables to include in the output.

Strata columns

specifies the variables to use to partition the input table into mutually exclusive, nonoverlapping subsets that are known as strata. Each stratum is defined by a set of values of the strata variables, and each stratum is sampled separately. The complete sample is the union of the samples that are taken from all the strata.

Note: If you do not assign any variables to this role, then the entire input table is treated as a single stratum.

You can allocate the total sample size among the strata in proportion to the size of the stratum. For example, the variable GENDER has possible values of M and F, and the variable VOTED has possible values of Y and N. If you assign both GENDER and VOTED to the Strata columns role, then the input table is partitioned into four strata: males who voted, males who did not vote, females who voted, and females who did not vote.

The input table contains 20,000 rows, and the values are distributed as follows:

7,000 males who voted
4,000 males who did not vote
5,000 females who voted
4,000 females who did not vote

Therefore, the proportion of males who voted is 7,000/20,000=0.35 or 35%. The proportions in the sample should reflect the proportions of the strata in the input table. For example, if your sample table contains 100 observations, then 35% of the values in the sample must be selected from the males who voted stratum to reflect the proportions in the input table.

Setting Options

Option Name	Description
Sample size	specifies the sample size in the desired number of rows or in the desired percentage of input rows. For example, if you specify 3% of rows and there are 400 input rows, then the resulting sample has 12 rows. Note: If you assign variables to the Strata columns role, then the sample size specification that you make here applies to each stratum rather than to the entire input table.
Sample method	specifies the method to use when sampling the data. Here are the valid values: Simple (no duplicates) specifies the simple method when sampling the input data. When a row is selected, it is removed from eligibility for subsequent selections. This makes it impossible to select the same row more than once. Unrestricted (duplicates allowed) specifies the unrestricted method when sampling the input data. When a row is selected, it remains eligible for subsequent selections. This makes it possible to select the same row more than once. You can specify how multiple selections of the same row are recorded in the output table. You can choose from the following options: Show each observations once in output (exclude duplicates) a row that is selected n times occurs in the sample once. In the output, the NumberHits variable (which is calculated automatically by the Random Sample task) lists the number of times that the observation occurred in the input table. Show all observations in output (include duplicates) a row that is selected n times occurs in the sample n times.
Location of output data set	specifies the name and location for the output data. By default, the data is saved to the Work library.
Random seed number	specifies the initial seed for the generation of random numbers. If you do not specify a random seed number, then a seed that is based on the system clock will be used to produce the sample.
Generate a sample selection summary	generates a summary table that includes the seed that was used to produce the sample. By specifying this same seed later with the same input table, you can reproduce the same sample.