Concepts in Data Preparation

Source Data

The data preparation interface enables administrators to access a wide variety of data sources for input to the data preparation process. The SAS/ACCESS interfaces can be used to interact with database management systems (DBMS). SAS data sets and SAS Scalable Performance Data Server can be used to construct a data warehouse. These data sources are available as source data for the analytic processing that can then be performed by the SAS LASR Analytic Server. The source data must be registered in SAS metadata as libraries and tables. For information about registering libraries and tables, see SAS Intelligence Platform: Data Administration Guide.

Data Output

The unique value of the data preparation interface is that it enables administrators to add data to a data provider that is co-located with the SAS LASR Analytic Server. SAS Visual Analytics Hadoop is a co-located data provider that is available from SAS. In this case, the Hadoop Distributed File System (HDFS) is used for data output. The purpose of adding the data to a co-located data provider is that the server can read data in parallel at very impressive rates from a co-located data provider. In addition to SAS Visual Analytics Hadoop, Teradata Enterprise Data Warehouse and EMC Greenplum Data Computing Appliance are supported co-located data providers. When a third-party vendor appliance is used, the SAS/ACCESS interface for the database must be licensed and configured.

Output to SAS Visual Analytics Hadoop

In addition to the performance advantage of using co-located data, when SAS Visual Analytics Hadoop is used, it provides data redundancy. By default, two copies of the data are stored in HDFS. If a machine in the cluster becomes unavailable, another machine in the cluster retrieves the data from a redundant block and loads the data into memory.
The data preparation interface distributes blocks evenly to the machines in the cluster so that all the machines acting as the server have an even workload. The block size is also optimized based on the number of machines in the cluster and the size of the data that is being stored. Before the data is transferred to HDFS, SAS software determines the number of machines in the cluster, row length, and the number of rows in the data. Using this information, SAS software calculates an optimal block size to provide an even distribution of data. However, the block size is bounded by a minimum size of 1 KB and a maximum size of 64 MB.
For very small data sets, the data is not distributed evenly. It is transferred to the root node of the cluster and then inserted into SAS Visual Analytics Hadoop. SAS Visual Analytics Hadoop then distributes the data in blocks according to the default block distribution algorithm.