When
you specify a SAS LASR Analytic Server
or SAS Data in HDFS library as an output library,
you can specify a partition key for the table. You can select a column
to use from the
Partition by menu.
The
feature uses the formatted values of the partition key to group rows
that have the same value for the key. All the rows that have the same
value for the key are loaded to a single machine in the cluster. For SAS LASR Analytic Server
libraries, this means that the rows with the same key are in memory
on one machine. For SAS Data in HDFS libraries,
all the rows with the same key are written to a single file block
on one machine. (The block is replicated to other machines for redundancy.)
When the table is loaded into a server, the partitioning remains when
it is in memory.
If you select a partition key and
also specify sort options for columns on the
Column Editor
tab, the sort options are passed to the engine in an ORDERBY= option.
This enhancement applies to SAS LASR Analytic Server
and SAS Data in HDFS libraries and can improve
performance once the data is in memory.
When you specify a partition
key, avoid using a variable that has few unique values. For example,
using a flag column that is Boolean results in all rows on two machines
because only two values are available. At the other end of the spectrum,
partitioning large tables by a nearly unique key results in many partitions
that have few rows.
Determining the optimal
partition key can be a challenging task. However, as an example, if
you tend to access data based on a customer ID, then you might improve
performance by partitioning the data by customer.