The PARTITION statement specifies how observations in the input data set are logically partitioned into disjoint subsets for
model training and testing. Either you can designate a variable in the input data set and a set of formatted values of that
variable to determine the role of each observation, or you can specify proportions to use for random assignment of observations
for each role.
You can specify one (but not both) of the following partition-options:
-
ROLEVAR | ROLE=variable (<TEST=’value’> <TRAIN=’value’>)
-
names the variable in the input data set whose values are used to assign roles to each observation. The formatted values of
this variable that are used to assign observations roles are specified in the TEST= and TRAIN= suboptions. If you specify
only the TEST= suboption, then all observations whose role is not determined by the TEST= suboption are assigned to training.
If you specify only the TRAIN= suboption, then all observations whose role is not determined by the TRAIN= suboption are assigned
to testing.
-
FRACTION( <TEST=fraction> <SEED=n> )
-
requests that specified proportions of the observations in the input data set be randomly assigned training and testing roles.
You specify the proportions for testing by using the TEST= suboption; the specified fraction must be less than 1 and the remaining
fraction of the observations are assigned to the training role. If you do not specify the TEST= suboption, the default fraction
is 0.5. The SEED= suboption specifies an integer that is used to start the pseudorandom number generator for random partitioning
of data for training and testing. If you do not specify a seed, or if you specify a value less than or equal to 0, the seed
is generated from reading the time of day from the computer’s clock.
Because fraction is a per-observation probability (which means that any particular observation has a probability of fraction of being assigned the testing role), using the FRACTION option can cause different numbers of observations to be assigned
training and testing roles. Different partitions can be observed when the number of nodes or threads changes or when PROC
HPPLS runs in alongside-the-database mode.