The PROC HPSAMPLE statement invokes the procedure.
You can specify the following options:
-
DATA=<libref.>table
-
names the table (SAS data set or database table) that you want to sample from. The default is the most recently opened or
created data set. If the data are already distributed, the procedure reads the data alongside the distributed database. See
the section "Single-machine Mode and Distributed Mode" on page 10 for the various execution modes and the section "Alongside-the-Database
Execution" on page 15 for the alongside-the-database model.
-
NONORM
-
distinguishes target values that share the same normalized value when you perform stratified sampling or oversampling. For
example, if a target has three distinct values, "A", "B", and "b", and you want to treat "B" and "b" as different levels,
you need to use NONORM. By default, "B" and "b" are treated as the same level. PROC HPSAMPLE normalizes a value as follows:
-
Leading blanks are removed.
-
The value is truncated to 32 characters.
-
Letters are changed from lowercase to uppercase.
Note: In the oversampling case, there is no normalization for levels by default. If you do not specify this option, you need
to specify a normalized event value in the EVENT= option.
-
OUT=<libref.>SAS-data-set
-
names the SAS data set that you want to output the sample to. If you run alongside the database, you need to specify a data
set that has the same database libref as the input data set and make sure it does not already exist in the database. This option is required.
Note: This SAS data set will contain the sample data set, which includes variables that are specified in VAR and CLASS statements.
If you also specify the PARTITION option, the output includes one more column, _PartInd_. In the oversampling case, an additional
column, _Freq_, is provided. It is calculated as the ratio of rare level’s proportion in the population to its proportion
in the sample.
-
PARTITION
-
produces an output data set that has the same number of rows as the input data set but has an additional partition indicator
(_PARTIND_), which indicates whether an observation is selected to the sample (1) or not (0). If you also specify the SAMPPCT2=
option, _PARTIND_ indicates whether an observation is selected to the sample1 (1), the sample2 (2), or the rest (0).
-
PARTINDNAME=partition-indicator-name
-
renames the partition indicator (_PARTIND_) to the specified partition-indicator-name.
-
SEED=random-seed
-
specifies the seed for the random number generator. If you do not specify random-seed or you specify it as a nonpositive number, the seed is set to be the default 12345. The SEED option enables you to reproduce
the same sample output.
You can specify the following options only for simple random sampling and stratified sampling:
-
SAMPOBS=number
-
specifies the minimum number of observations you want to sample from the input data. The value of number must be a positive integer. If number exceeds the total number of observations in the input data, the output sample has the same number of observations as the
input data set.
-
SAMPPCT=sample-percentage
-
specifies the sample percentage to be used by PROC HPSAMPLE. The value of sample-percentage should be a positive number less than 100. For example, SAMPPCT=50.5 specifies that you want to sample 50.5% of data.
Note: You must specify either the SAMPOBS or the SAMPPCT option if you want to perform simple random sampling or stratified sampling.
If you specify both options, only the SAMPPCT option is honored.
-
SAMPPCT2=sample-percentage
-
partitions the input data into three parts when specified along with the SAMPPCT= and PARTITION options. The percentage of
the sample whose _PARTIND_=1 is specified in the SAMPPCT= option, the percentage of the sample whose _PARTIND_=2 is specified
in the SAMPPCT2= option, and the percentage of the sample whose _PARTIND_=0 is 100 minus the sum of the values of the SAMPPCT=
and SAMPPCT2= options. The sum of the sample-percentages specified in this option and in the SAMPPCT2= option must be a positive number less than 100.
You can specify the following options only for oversampling:
-
EVENT="rare-event-level"
-
specifies the rare event level. If you specify this option, PROC HPSAMPLE uses an oversampling technique to adjust the class
distribution of a data set, and the following two options are required.
-
SAMPPCTEVT=sample-event-percentage
-
specifies the sample percentage from the event level. The value of sample-event-percentage should be a positive number less than 100. For example, SAMPPCTEVT=50.5 specifies that you want to sample 50.5 percent of
the rare event level.
-
EVENTPROP=event-proportion
-
specifies the proportion of rare events in the sample. The value of event-proportion should be a positive number less than 1. For example, EVENTPROP=0.3 specifies that you want the ratio between rare events
and not rare events to be 3:7.