The option PARTSIZE=
forces the software to partition data files at the specified size.
The actual size of the partition is computed to accommodate the maximum
number of rows that fit in the specified size of n megabytes.
Multiple partitions
are necessary to read data in parallel. The value is specified when
a table is created. This size is a fixed size. This specification
applies only to the data component files. Splitting (partitioning)
the data portion of a table into fixed-sized files can introduce a
high degree of scalability for some operations. The software can spawn
threads in parallel (for example, up to one thread per partition for
WHERE evaluations). Separate data partitions also enable the engine
to process the data without the overhead of file access contention
between the threads. Because each partition is one file, the trade-off
for a small partition size is that an increased number of files (for
example, UNIX i-nodes) are required to store the observations.
The partition size determines
a unit of work for many of the parallel operations that require full
data set scans. But, having more partitions does not always mean faster
processing. The trade-offs involve balancing the increased number
of physical files (partitions) that are required to store the data
set against the amount of work that can be done in parallel by having
more partitions. More partitions means more open files to process
the data set, but a smaller number of rows in each partition.
A general rule is to
have 10 or fewer partitions per data path, and 3 to 4 partitions per
CPU. (Some operating systems have a limit on the number of open files
that you can use.)
To determine an adequate
partition size for a new table, you should be aware of the following:
-
the types of applications that
run against the data
-
-
how many CPUs are available to
the applications
-
which disks are available for storing
the partitions
-
the relationships of these disks
to the CPUs
For example, if each
CPU controls only one disk, then an appropriate partition size would
be one in which each disk contains approximately the same amount of
data. If each CPU controls two disks, then an appropriate partition
size would be one in which the load is balanced. Each CPU does approximately
the same amount of work.
Ultimately, scalability
limits using PARTSIZE= depend on how the DATAPATH= is structured.
For information about the DATAPATH= for SPD Engine data sets, see SAS Scalable Performance Data Engine: Reference. For information about the DATAPATH= for SPD Server
tables, see SAS Scalable Performance Data Server: Administrator’s Guide. Specifically, the limits depend on how you configure
and spread the DATAPATH= file systems across striped volumes. You
should spread each individual volume's striping configuration
across multiple disk controllers or SCSI channels in the disk storage
array. The goal for the configuration is, at the hardware level, to
maximize parallelism during data retrieval.
The PARTSIZE= specification
is limited by the MINPARTSIZE= system option. MINPARTSIZE= ensures
that an over-zealous SAS user does not create arbitrarily small partitions,
thereby generating a large number of files. The default MINPARTSIZE=
value for both SPD Engine data sets and SPD Server tables is 16 megabytes
and probably should not be lowered much beyond this value. Many SPD
Server sites specify a higher MINPARTSIZE= value in their server parameter
file. A recommended MINPARTSIZE= setting for SPD Server tables when
large tables are used is 1 gigabyte.
Note: The PARTSIZE= value for a
data set cannot be changed after a data set is created. To change
the PARTSIZE=, you must delete the table with the DROP TABLE statement
and create it again with the CREATE TABLE statement.