Allocating the Library Space :: SAS® 9.4 Scalable Performance Data Engine: Reference, Fourth Edition

How to Allocate the Library Space

To realize performance gains through SPD Engine’s partitioned data read and threading capabilities, the SPD Engine libraries must be properly configured and managed. Optimally, a SAS system administrator performs these tasks.

An SPD Engine data set requires a file system with enough space to store the various component files. Often that file system includes multiple directories for these components. Usually, a single directory path (part of a file system) is constrained by a volume limit for the file system as a whole. This limit is the maximum amount of disk space configured for the file system to use.

Within this maximum disk space, you must allocate enough space for all of the SPD Engine component files. Understanding how each component file is handled is critical to estimating the amount of storage that you need in each library.

Configuring Space for All Components in a Single Path

In the simplest SPD Engine library configuration, all of the SPD Engine component files (data files, metadata files, and index files) can reside in a single path called the primary path. The primary path is the default path specification in the LIBNAME statement. The following LIBNAME statement sets up the primary file system for the MyLib library:

libname mylib spde '/disk1/spdedata';

Because there are no other path options specified, all component files are created in this primary path. Storing all component file types in the primary path is simple and works for very small data sets. It does not take advantage of the performance boost that storing components separately can achieve, nor does it take advantage of multiple CPUs.

Note: The SPD Engine requires complete pathnames to be specified.

Configuring Separate Library Space for Each Component File Type

Most sites use the SPD Engine to manage very large amounts of data, which can have thousands of variables and some of them indexed. At these sites, separate storage paths are usually defined for the various component types. In addition, using disk striping and RAID (Redundant Array of Independent Disks) can be very efficient. For more information, see “SPD Engine Disk I/O Setup” in Scalability and Performance at http://support.sas.com/rnd/scalability/spde/setup.html.

The metadata component files for all data sets in a library must reside in the primary path.

In addition, specifying separate paths for the data component files and index component files provides performance gains. You specify separate paths because the Read load is distributed across disk drives. Separating the data and index component files helps prevent disk contention and increases the level of parallelism that can be achieved, especially in complex WHERE evaluations. The following example code specifies a primary path for the metadata component files. The code uses the DATAPATH= option and the INDEXPATH= option to specify additional, separate paths for the data and index component files:

libname all_users spde '/disk1/metadata'  
   datapath= ('/disk2/userdata' '/disk3/userdata')  
   indexpath= ('/disk4/userindexes' '/disk5/userindexes');

The metadata component files are stored on disk1, which is the primary path. The data component files are on disk2 and disk3, and the index component files are on disk4 and disk5. For all path specifications, you must specify the complete pathname.

CAUTION:

The primary path must be unique for each library.

If two librefs are created with the same primary path but with differences in the other paths, data can be lost. You cannot use NFS in any path other than the primary path.

Note: If you are planning to store data in locally mounted drives and to access the data from a remote computer, use the remote pathname when you specify the LIBNAME. If /data01 and /data02 are locally mounted drives on the localA computer, use the pathnames /nfs/localA/data01 and /nfs/localA/data02 in the LIBNAME statement.

Anticipating the Space for Each Component File

To properly configure the SPD Engine library space, you need to understand the relative sizes of the SPD Engine component files. The following information provides a general overview. For more information, see “SPD Engine Disk I/O Setup” in Scalability and Performance at http://support.sas.com/rnd/scalability/spde/setup.html.

Metadata component files are relatively small files, but the primary path that you specify must be large enough to contain all of the metadata component files for the library. Metadata component files cannot grow beyond the space available in the path.

Index component files (both .idx and .hbx) can be medium to large, depending on the number of distinct values in each index and whether the index is a single or composite index. When an index component file grows beyond the space available in the current file path, a new index component file is created in the next path.

Data component files can be numerous, depending on the amount of data and the partition size specified for the data set. Each data partition is stored as a separate data component file. The size of the data partition is specified in the PARTSIZE= LIBNAME statement option or in the PARTSIZE= data set option. Data component files are the only component files for which you can specify a partition size.

Storage of the Metadata Component Files

Metadata Component Files

The metadata component file for an SPD Engine data set stores the descriptive information about the data set and the pathnames to its constituent data component files and index component files. This concept is very important to understand because it directly affects whether you can add data sets (with their associated metadata component files) to the library.

The metadata component files for all of the data sets in a library must reside in the same location specified in the primary path. In effect, the files in the primary path act like a directory to the entire library. When an SPD Engine data set is accessed, the SPD Engine first opens the data set’s metadata component file to determine its attributes and to determine whether it can access all of its other component files. If a new data set for a library is created, and the space in the primary path is full, the SPD Engine cannot begin creating the metadata component file in that path, and the Create operation fails with an appropriate error message. To successfully create a new data set in this case, you must either free space in the primary path or assign a new library and copy some or all of the data sets to the new library. Data component files and index component files do not have that limitation. You can specify additional space at a later time for data component files and index component files.

Certain actions cause metadata component files to grow to exceed the file size or space limitations. In that case, the SPD Engine creates another partition of the metadata component file to accommodate the overflow. New metadata partitions can reside in the primary path or in the paths specified in the METAPATH= LIBNAME Statement Option.. You cannot use the METAPATH= option to create space for a new data set’s first metadata partition. The METAPATH= option specifies space for only metadata component files beyond the first one.

Storage of the Index Component Files

An index component file is stored based on overflow space. When an index component file grows to exceed the file size or space limitations, the SPD Engine creates another partition of the index component file to accommodate the overflow. When several file paths are specified with the INDEXPATH= option, index component files are created in the first available space, and then they overflow to the next path when the previous path is filled. Unlike metadata component files, index component files do not have to be in the primary path.

Storage of the Data Component Files

The data component files are the only files for which you can specify the partition size. Partitioned data can be processed in threads easily, taking full advantage of multiple CPUs on your computer. The partition size for the data component file is fixed. It is set when the data set is created. The default is 128 megabytes, but you can specify a different partition size using the PARTSIZE= option. Performance depends on appropriate partition sizes. You are responsible for knowing the sizes and uses of the data. SPD Engine data sets can be created with a partition size that results in a balanced number of observations. (For more information, see PARTSIZE= Data Set Option and PARTSIZE= LIBNAME Statement Option. Many data partitions can be created in each data path for a given data set. The SPD Engine uses the file paths that you specify with the DATAPATH= option to distribute partitions in a cyclic fashion. The SPD Engine creates the first data partition in one of the specified paths, the second partition in the next path, and so on. The SPD Engine continues to cycle through the file paths as many times as needed until all data partitions for the data set are stored. The file path for the first partition is selected at random. Assume that you specify the following in your LIBNAME statement:

datapath=('/data1' '/data2')

The SPD Engine stores the first partition in /DATA1, the second partition in /DATA2, the third partition in /DATA1, and so on. Cyclical distribution of the data partitions creates disk striping, which can be highly efficient. Disk striping is discussed in detail in “SPD Engine Disk I/O Setup” in Scalability and Performance at http://support.sas.com/rnd/scalability/spde/setup.html.

Initial Set of Paths

In the following example, the LIBNAME statement specifies the MyLib directory for the primary path. This path is used to store the metadata partitions. Other devices and directories are specified to store the data and index partitions.

libname myref spde 'Mylib'
   datapath=('/mydisk30' '/mydisk31')
   indexpath=('/mydisk36');

Assuming that all of the data sets created in the MyLib library were large enough to have several data partitions, they will all have their metadata in MyLib, their data in /mydisk30 and /mydisk31, and any indexes in /mydisk36. This specifically means that the metadata component files for those data sets include those pathnames.

Adding Subsequent Paths

Later, if more space is needed (for example, for appending more data), additional devices can be added for the data and index partitions, as in the following example:

libname myref spde 'Mylib'
   datapath=('/mydisk30' '/mydisk31' '/mydisk32') 
   indexpath=('/mydisk36' '/mydisk37');

All of the data sets created with the MyLib library will have their metadata in MyLib, their data in one or more of the three paths, and any indexes in /mydisk36 or /mydisk37. If data was appended to an existing data set, the new data goes in one or more of the three paths and the metadata component file is updated accordingly.

If one or more of the data or index partitions do not have much free space, you can exclude them in the LIBNAME statement the next time you specify it:

libname myref spde 'Mylib'
   datapath=('/mydisk31' '/mydisk32' '/mydisk33') 
   indexpath=('/mydisk37' '/mydisk38');

The SPD Engine is still able to access data sets that use the excluded paths because the data sets’ metadata includes all of the used paths.

Omitting Paths

If you need to read only the data sets in a library, then because all of the necessary path information is already in the metadata component files, you can specify the LIBNAME statement without the extra DATAPATH= and INDEXPATH= options:

libname myref spde 'Mylib';

Renaming, Copying, or Moving Component Files

CAUTION:

Do not rename, copy, or move an SPD Engine data set or its component files using operating system commands.

You should always use the COPY procedure to copy SPD Engine data sets from one location to another or the DATASETS procedure to rename or delete SPD Engine data sets.