The
best performance for reading data into memory on the SAS LASR Analytic Server
occurs when the server is co-located with the distributed data and
the data is distributed evenly. The OLIPHANT procedure distributes
the data such that parallel read performance by the SAS LASR Analytic Server
is maximized. In addition, the distribution also ensures an even workload
for query activity performed by the SAS LASR Analytic Server.
In order to produce
an even distribution of data, it is important to understand that Hadoop
stores data in blocks and that any block that contains data occupies
the full size of the block on disk. The default block size is 32 megabytes
and blocks are padded to reach the block size after the data is written.
The data is distributed among the machines in the cluster in round-robin
fashion. In order to maximize disk space, you can specify a block
size that minimizes the padding.
It is important to know
the size of the input data set such as the row count and the length
of a row. This information, along with the number of machines in the
cluster, can be used to set a block size that distributes the blocks
evenly on the machines in the cluster and uses the space in the blocks
efficiently.
For example, if the input data set is approximately 25 million rows with a row length
of 1300 bytes, then the data set is approximately 30 gigabytes. If the hardware is
a cluster of 16 machines, and 15 machines are used to provide HDFS storage, then storing
2 gigabytes on each machine is optimal. In this case, a BLOCKSIZE=
setting of 32 megabytes or 64 megabytes would fill the overwhelming majority of blocks
with data and reduce the space that is wasted by padding.