When
you specify a SAS LASR Analytic Server
or SAS Data in HDFS
library as an output library, you can specify a partition key for the table. You can select
a column to use from the
Partition by menu.
Partitioning uses the
formatted values of the partition key to group rows that have the same value for the key. All of the
rows that have the same value for the key are loaded to a single machine in the
cluster. For SAS LASR Analytic Server
libraries, this means that the rows that have the same value for the key are in memory on one
machine. For SAS Data in HDFS
libraries, all of the rows that have the same value for the key are
written to a single file block on one machine. (The block is replicated
to other machines for redundancy.) When the partitioned table is loaded
onto a server, the partitioning remains when it is in memory.
If you select a partition key and also
specify sort options for columns on the
Column Editor tab, the sort options are passed to the current
engine in an ORDERBY= option. This enhancement applies to SAS LASR Analytic Server
and SAS Data in HDFS libraries and can improve performance once the data is in memory.
When you specify a partition
key, avoid using a variable that has few unique values. For example,
partitioning by a flag column that is Boolean results in all rows
on two machines because only two values are available. At the other
end of the spectrum, partitioning large tables by a nearly unique
key results in many partitions that have few rows.
Determining the optimal
partition key can be a challenging task. However, as an example, if
you tend to access data based on a customer ID, then you might improve
performance by partitioning the data by customer.