In
addition to the performance advantage of using co-located data, when SAS Visual Analytics Hadoop
is used, it provides data redundancy. By default, two copies of the
data are stored in HDFS. If a machine in the cluster becomes unavailable,
another machine in the cluster retrieves the data from a redundant
block and loads the data into memory.
The data preparation
interface distributes blocks evenly to the machines in the cluster
so that all the machines acting as the server have an even workload.
The block size is also optimized based on the number of machines in
the cluster and the size of the data that is being stored. Before
the data is transferred to HDFS, SAS software determines the number
of machines in the cluster, row length, and the number of rows in
the data. Using this information, SAS software calculates an optimal
block size to provide an even distribution of data. However, the block
size is bounded by a minimum size of 1 KB and a maximum size of 64
MB.
For very small data
sets, the data is not distributed evenly. It is transferred to the
root node of the cluster and then inserted into SAS Visual Analytics Hadoop. SAS Visual Analytics Hadoop
then distributes the data in blocks according to the default block
distribution algorithm.