In addition to the performance
advantage of using co-located data, when SAS Visual Analytics Hadoop
is used, it provides data redundancy. By default, two copies of the
data are stored in HDFS. If a machine in the cluster becomes unavailable,
another machine in the cluster retrieves the data from a redundant
block and loads the data into memory.
SAS Visual Analytics Administrator
distributes blocks evenly to the machines in the cluster so that all
the machines acting as the server have an even workload. The block
size is also optimized based on the number of machines in the cluster
and the size of the data that is being stored. Before the data is
transferred to HDFS, SAS software determines the number of machines
in the cluster, row length, and the number of rows in the data. Using
this information, SAS software calculates an optimal block size to
provide an even distribution of data. However, the block size is bounded
by a minimum size of 1 KB and a maximum size of 64 MB.
For very small data
sets, the data is not distributed evenly. It is transferred to the
root node of the cluster and then inserted into SAS Visual Analytics
Hadoop. SAS Visual Analytics Hadoop then distributes the data in blocks
according to the default block distribution algorithm.