When
you specify a SAS LASR Analytic Server
or SAS Data in HDFS library as an output library,
you can specify a partition key for the table. You can select a column
to use from the
Partition by menu.
Partitioning
uses the formatted values of the partition key to group rows that
have the same value for the key. All of the rows that have the same
value for the key are loaded to a single machine in the cluster. For SAS LASR Analytic Server
libraries, this means that the rows that have the same value for the
key are in memory on one machine. For SAS Data in HDFS
libraries, all of the rows that have the same value for the key are
written to a single file block on one machine. (The block is replicated
to other machines for redundancy.) When the partitioned table is loaded
onto a server, the partitioning remains when it is in memory.
If you select a partition key and also
specify sort options for columns on the
Column Editor
tab, the sort options are passed to the engine in an ORDERBY= option.
This enhancement applies to SAS LASR Analytic Server
and SAS Data in HDFS libraries and can improve
performance once the data is in memory.
When you specify a partition
key, avoid using a variable that has few unique values. For example,
partitioning by a flag column that is Boolean results in all rows
on two machines because only two values are available. At the other
end of the spectrum, partitioning large tables by a nearly unique
key results in many partitions that have few rows.
Determining the optimal
partition key can be a challenging task. However, as an example, if
you tend to access data based on a customer ID, then you might improve
performance by partitioning the data by customer.