SPD Engine data must be stored in multiple partitions for it to be subsequently processed in
parallel. Specifying PARTSIZE= forces the software to partition SPD Engine data files
at the specified size. The actual size of the partition is
computed to accommodate the maximum number of observations that fit in the specified
size of
n megabytes,
gigabytes, or terabytes.
By splitting (partitioning) the data portion of an
SPD Engine data set into fixed-sized files, the software can introduce a high degree of
scalability for some operations. The SPD Engine can
spawn threads in parallel (for example, up to one
thread per partition for WHERE evaluations). Separate data partitions also enable the SPD
Engine to process
the data without the overhead of file access contention between the threads. Because
each partition is one file, the trade-off for a small partition size is that an increased
number of files (for example, UNIX i-nodes) are required to store the observations.
Scalability limitations
using PARTSIZE= depend on how you configure and spread the file systems
specified in the DATAPATH= option across striped volumes. (You should
spread each individual volume's striping configuration across
multiple disk controllers or SCSI channels in the disk storage array.)
The goal for the configuration is to maximize parallelism during data
retrieval. For information about disk striping, see “
I/O Setup and Validation” under “SPD Engine” in Scalability and Performance at
http://support.sas.com/rnd/scalability.
The PARTSIZE= specification is limited by the SPD Engine system option MINPARTSIZE=,
which is usually set and maintained by the system administrator.
MINPARTSIZE= ensures that an inexperienced user does not arbitrarily create small
partitions, thereby generating a large number of files.
The partition size determines a unit of work for many of the parallel operations that
require full
data set scans. But, more partitions does not always mean faster processing. The trade-offs
involve balancing the increased number of physical files (partitions) required to
store the data set against the amount of work that can be done in parallel by having
more partitions. More partitions means more open files to process the data set, but
a smaller number of observations in each partition. A general rule is to have 10 or
fewer partitions per data path and 3 to 4 partitions per CPU.
To determine an adequate partition size for a new SPD Engine data set, you should
be aware of the following:
-
the types of applications that
run against the data
-
-
how many CPUs are available to
the applications
-
which disks are available for storing
the partitions
-
the relationships of these disks
to the CPUs
For example, if each CPU controls only one disk, then an appropriate partition size
would be one in which each disk contains approximately the same amount of data.
If each CPU controls two disks, then an appropriate partition size would be one in
which the load is balanced. Each CPU does approximately the same amount of work.
Note: The PARTSIZE= value for a
data set cannot be changed after a data set is created. To change
PARTSIZE=, you must re-create the data set and specify a different
PARTSIZE= value in the LIBNAME statement or on the new (output) data
set.