The option PARTSIZE=
forces the software to partition data files at the specified size.
The actual size of the partition is computed to accommodate the maximum
number of rows that fit in the specified size of
n.
Multiple partitions
are necessary to read the data in SPD Engine data sets in parallel.
The value is specified when an SPD Engine data set is created. This
size is a fixed size. This specification applies only to the data
component files. By splitting (partitioning) the data portion of a
data set into fixed-sized files, the engine can introduce a high degree
of scalability for some operations. The engine can spawn threads in
parallel (for example, up to one thread per partition for WHERE evaluations).
Separate data partitions also enable the engine to process the data
without the overhead of file access contention between the threads.
Because each partition is one file, the trade-off for a small partition
size is that an increased number of files (for example, UNIX i-nodes)
are required to store the observations.
The partition size determines
a unit of work for many of the parallel operations that require full
data set scans. But, having more partitions does not always mean faster
processing. The trade-offs involve balancing the increased number
of physical files (partitions) that are required to store the data
set against the amount of work that can be done in parallel by having
more partitions. More partitions means more open files to process
the data set, but a smaller number of rows in each partition.
A general rule is to
have 10 or fewer partitions per data path, and 3 to 4 partitions per
CPU. (Some operating systems have a limit on the number of open files
that you can use.)
To determine an adequate
partition size for a new SPD Engine data set, you should be aware
of the following:
-
the types of applications that
run against the data
-
-
how many CPUs are available to
the applications
-
which disks are available for storing
the partitions
-
the relationships of these disks
to the CPUs
For example, if each
CPU controls only one disk, then an appropriate partition size would
be one in which each disk contains approximately the same amount of
data. If each CPU controls two disks, then an appropriate partition
size would be one in which the load is balanced. Each CPU does approximately
the same amount of work.
Ultimately, scalability
limits using PARTSIZE= depend on how you structure the DATAPATH=.
See information about the DATAPATH= LIBNAME option in
SAS Scalable Performance Data Engine: Reference. Specifically, the limits depend on how you configure
and spread the DATAPATH= file systems across striped volumes. You
should spread each individual volume's striping configuration
across multiple disk controllers or SCSI channels in the disk storage
array. The goal for the configuration is, at the hardware level, to
maximize parallelism during data retrieval.
The PARTSIZE= specification
is limited by the MINPARTSIZE= system option. MINPARTSIZE= ensures
that an over-zealous SAS user does not create arbitrarily small partitions,
thereby generating a large number of files. The default MINPARTSIZE=
for SPD Engine data sets is 16 megabytes and probably should not be
lowered much beyond this value.
Note: The PARTSIZE= value for a
data set cannot be changed after a data set is created. To change
the PARTSIZE=, you must delete the table with the DROP TABLE statement
and create it again with the CREATE TABLE statement.