The HPSAMPLE Procedure

PROC HPSAMPLE statement

  • PROC HPSAMPLE <options>;

The PROC HPSAMPLE statement invokes the procedure.

You can specify the following options:

DATA=<libref.>table

names the table (SAS data set or database table) that you want to sample from. The default is the most recently opened or created data set. If the data are already distributed, the procedure reads the data alongside the distributed database. See the section "Single-machine Mode and Distributed Mode" on page 10 for the various execution modes and the section "Alongside-the-Database Execution" on page 15 for the alongside-the-database model.

NONORM

distinguishes target values that share the same normalized value when you perform stratified sampling or oversampling. For example, if a target has three distinct values, "A", "B", and "b", and you want to treat "B" and "b" as different levels, you need to use NONORM. By default, "B" and "b" are treated as the same level. PROC HPSAMPLE normalizes a value as follows:

  1. Leading blanks are removed.

  2. The value is truncated to 32 characters.

  3. Letters are changed from lowercase to uppercase.

Note: In the oversampling case, there is no normalization for levels by default. If you do not specify this option, you need to specify a normalized event value in the EVENT= option.

OUT=<libref.>SAS-data-set

names the SAS data set that you want to output the sample to. If you run alongside the database, you need to specify a data set that has the same database libref as the input data set and make sure it does not already exist in the database. This option is required.

Note: This SAS data set will contain the sample data set, which includes variables that are specified in VAR and CLASS statements. If you also specify the PARTITION option, the output includes one more column, _PartInd_. In the oversampling case, an additional column, _Freq_, is provided. It is calculated as the ratio of rare level’s proportion in the population to its proportion in the sample.

PARTITION

produces an output data set that has the same number of rows as the input data set but has an additional partition indicator (_PARTIND_), which indicates whether an observation is selected to the sample (1) or not (0). If you also specify the SAMPPCT2= option, _PARTIND_ indicates whether an observation is selected to the sample1 (1), the sample2 (2), or the rest (0).

PARTINDNAME=partition-indicator-name

renames the partition indicator (_PARTIND_) to the specified partition-indicator-name.

SEED=random-seed

specifies the seed for the random number generator. If you do not specify random-seed or you specify it as a nonpositive number, the seed is set to be the default 12345. The SEED option enables you to reproduce the same sample output.

You can specify the following options only for simple random sampling and stratified sampling:

SAMPOBS=number

specifies the minimum number of observations you want to sample from the input data. The value of number must be a positive integer. If number exceeds the total number of observations in the input data, the output sample has the same number of observations as the input data set.

SAMPPCT=sample-percentage

specifies the sample percentage to be used by PROC HPSAMPLE. The value of sample-percentage should be a positive number less than 100. For example, SAMPPCT=50.5 specifies that you want to sample 50.5% of data.

Note: You must specify either the SAMPOBS or the SAMPPCT option if you want to perform simple random sampling or stratified sampling. If you specify both options, only the SAMPPCT option is honored.

SAMPPCT2=sample-percentage

partitions the input data into three parts when specified along with the SAMPPCT= and PARTITION options. The percentage of the sample whose _PARTIND_=1 is specified in the SAMPPCT= option, the percentage of the sample whose _PARTIND_=2 is specified in the SAMPPCT2= option, and the percentage of the sample whose _PARTIND_=0 is 100 minus the sum of the values of the SAMPPCT= and SAMPPCT2= options. The sum of the sample-percentages specified in this option and in the SAMPPCT2= option must be a positive number less than 100.

You can specify the following options only for oversampling:

EVENT="rare-event-level"

specifies the rare event level. If you specify this option, PROC HPSAMPLE uses an oversampling technique to adjust the class distribution of a data set, and the following two options are required.

SAMPPCTEVT=sample-event-percentage

specifies the sample percentage from the event level. The value of sample-event-percentage should be a positive number less than 100. For example, SAMPPCTEVT=50.5 specifies that you want to sample 50.5 percent of the rare event level.

EVENTPROP=event-proportion

specifies the proportion of rare events in the sample. The value of event-proportion should be a positive number less than 1. For example, EVENTPROP=0.3 specifies that you want the ratio between rare events and not rare events to be 3:7.