The unique value of
the data preparation interface is that it enables administrators
to add data to the Hadoop Distributed File System (HDFS) that is co-located
with the SAS LASR Analytic Server
on the machines in the grid. The purpose of adding the data to HDFS
is that the server can read data in parallel at very impressive rates
from a co-located data provider such as HDFS.
In addition to the performance
advantage of using co-located data, HDFS provides data redundancy.
By default, two copies of the data are stored in HDFS. If a machine
in the grid becomes unavailable, another machine in the grid retrieves
the data from a redundant block and loads the data into memory.
The blocks are distributed
evenly so that all the machines acting as the server have an even
workload. The block size is also optimized based on the number of
machines in the grid and the size of the data that is being stored.
Before the data is transferred to HDFS, SAS software determines the
number of machines in the grid, row length, and the number of rows
in the data. Using this information, SAS software calculates an optimal
block size to provide an even distribution of data. However, the block
size is bounded by a minimum size of 1 KB and a maximum size of 64
MB.
For very small data
sets, the data is not distributed evenly. It is transferred to the
root node of the grid and then inserted into SAS Visual Analytics Hadoop. SAS Visual Analytics Hadoop
then distributes the data in blocks according to the default block
distribution algorithm.