PARTITION
<partition-options> ;
The PARTITION statement specifies how observations in the input data set are logically partitioned into disjoint subsets for
model training and validation. Either you can designate a variable in the input data set and a set of formatted values of
that variable to determine the role of each observation, or you can specify proportions to use for random assignment of observations
to each role.
You can specify one (but not both) of the following:
-
FRACTION(VALIDATE=fraction) <SEED=number>
-
requests that specified proportions of the observations in the input data set be randomly assigned to training and validation
roles. You specify the proportions for testing and validation by using the VALIDATE= suboption. The SEED suboption sets the
seed. Because fraction is a per-observation probability, setting fraction too low can result in an empty or nearly empty validation set.
The default is SEED=3054.
Using the FRACTION option can cause different numbers of observations to be selected for the validation set because this option
specifies a per-observation probability. Different partitions can be observed when the number of nodes or threads changes
or when PROC HPSPLIT runs in alongside-the-database mode.
The following PARTITION statement shows how to use a probability of choosing a particular observation for the validation set:
partition fraction(validate=0.1) / seed=1234;
In this example, any particular observation has a probability of 10% of being selected for the validation set. All nonselected
records are in the training set. The seed that is used for the random number generator is specified by the SEED= option.
-
ROLEVAR=variable(TRAIN=’value’, VALID=’value’)
-
names the variable in the input data set whose values are used to assign roles to each observation. The formatted values of this variable, which are used to assign observations roles, are specified in the TRAIN= and VALID= suboptions.
In the following example, the ROLEVAR= option specifies _PARTIND_
as the variable in the input data set that is used to select the data set.
partition rolevar=_partind_(TRAIN='1', VALID='0');
The TRAIN= and VALID= options provide the values that indicate whether an observation is in the training or validation set,
respectively. Observations in which the variable is missing or a value that corresponds to neither argument are ignored. Formatting
and normalization are performed before comparison, so you should specify numeric variable values as formatted values, as in
the preceding example.
Copyright © SAS Institute Inc. All Rights Reserved.