The HPSAMPLE Procedure

PROC HPSAMPLE statement

PROC HPSAMPLE <options> ;

The PROC HPSAMPLE statement invokes the procedure.

You can specify the following options:

DATA=<libref.>table

names the table (SAS data set or database table) that you want to sample from. The default is the most recently opened or created data set. If the data are already distributed, the procedure reads the data alongside the distributed database. See the section Single-machine Mode and Distributed Mode on page 10 for the various execution modes and the section Alongside-the-Database Execution on page 15 for the alongside-the-database model.

NONORM

distinguishes target values that share the same normalized value when you perform stratified sampling or oversampling. For example, if a target has three distinct values, A, B, and b, and you want to treat B and b as different levels, you need to use NONORM. By default, B and b are treated as the same level. PROC HPSAMPLE normalizes a value as follows:

  1. Leading blanks are removed.

  2. The value is truncated to 32 characters.

  3. Letters are changed from lowercase to uppercase.

Note: In the oversampling case, there is no normalization for levels by default. If you do not specify this option, you need to specify a normalized event value in the EVENT= option.

OUT=<libref.>SAS-data-set

names the SAS data set that you want to output the sample to. If you run alongside the database, you need to specify a data set that has the same database libref as the input data set and make sure it does not already exist in the database. This option is required.

Note: This SAS data set will contain the sample data set, which includes variables that are specified in VAR and CLASS statements. If you also specify the PARTITION option, the output includes one more column, _PartInd_. In the oversampling case, an additional column, _Freq_, is provided. It is calculated as the ratio of rare level’s proportion in the population to its proportion in the sample.

PARTITION

produces an output data set that has the same number of rows as the input data set but has an additional partition indicator (_PARTIND_), which indicates whether an observation is selected to the sample (1) or not (0).

SEED=random-seed

specifies the seed for the random number generator. If you do not specify random-seed or you specify it as a negative number, the seed is set to be the default 12345. The SEED option enables you to reproduce the same sample output.

You can specify the following options only for simple random sampling and stratified sampling:

SAMPOBS=number

specifies the minimum number of observations you want to sample from the input data. The value of number must be a positive integer. If number exceeds the total number of observations in the input data, the output sample has the same number of observations as the input data set.

SAMPPCT=sample-percentage

specifies the sample percentage to be used by PROC HPSAMPLE. The value of sample-percentage should be a positive number less than 100. For example, SAMPPCT=50.5 specifies that you want to sample 50.5 percent of data.

Note: You must specify either the SAMPOBS or the SAMPPCT option if you want to perform simple random sampling or stratified sampling. If you specify both options, only the SAMPPCT option is honored.

You can specify the following options only for oversampling:

EVENT=rare-event-level

specifies the rare event level. If you specify this option, PROC HPSAMPLE uses an oversampling technique to adjust the class distribution of a data set, and the following two options are required.

SAMPPCTEVT=sample-event-percentage

specifies the sample percentage from the event level. The value of sample-event-percentage should be a positive number less than 100. For example, SAMPPCTEVT=50.5 specifies that you want to sample 50.5 percent of the rare event level.

EVENTPROP=event-proportion

specifies the proportion of rare events in the sample. The value of event-proportion should be a positive number less than 1. For example, EVENTPROP=0.3 specifies that you want the ratio between rare events and not rare events to be 3:7.