Concepts in Data Preparation

Input Data

The data preparation interface enables administrators to access a wide variety of data sources for input to the data preparation process. The SAS/ACCESS interfaces can be used to interact with database management systems (DBMS). SAS data sets and SAS Scalable Performance Data Server can be used to construct a data warehouse. These data sources are available as input data for the analytic processing that can then be performed by the SAS LASR Analytic Server. The input data must be registered in SAS metadata as libraries and tables. For information about registering libraries and tables, see SAS Intelligence Platform: Data Administration Guide.

Data Output

The unique value of the data preparation interface is that it enables administrators to add data to the Hadoop Distributed File System (HDFS) that is co-located with the SAS LASR Analytic Server on the machines in the grid. The purpose of adding the data to HDFS is that the server can read data in parallel at very impressive rates from a co-located data provider such as HDFS.

In addition to the performance advantage of using co-located data, HDFS provides data redundancy. By default, two copies of the data are stored in HDFS. If a machine in the grid becomes unavailable, another machine in the grid retrieves the data from a redundant block and loads the data into memory.

The blocks are distributed evenly so that all the machines acting as the server have an even workload. The block size is also optimized based on the number of machines in the grid and the size of the data that is being stored. Before the data is transferred to HDFS, SAS software determines the number of machines in the grid, row length, and the number of rows in the data. Using this information, SAS software calculates an optimal block size to provide an even distribution of data. However, the block size is bounded by a minimum size of 1 KB and a maximum size of 64 MB.

For very small data sets, the data is not distributed evenly. It is transferred to the root node of the grid and then inserted into SAS Visual Analytics Hadoop. SAS Visual Analytics Hadoop then distributes the data in blocks according to the default block distribution algorithm.