Efficient Indexing in the SPD Engine

Parallel Indexing

Indexes can improve the performance of WHERE expression processing and BY expression processing. The SPD Engine enables the rapid creation and update of indexes because it can process them in parallel.

The SPD Engine's indexes are especially appropriate for data sets of varying sizes and data distributions. These indexes contain both a segmented view and a global view of indexed variables' values. This feature enables the SPD Engine to optimally support both of the following queries:

queries that require global data views, such as BY expression processing
queries that require segmented views, such as parallel processing of WHERE expressions

Parallel Index Creation

You can create indexes on your SPD Engine data in parallel, asynchronously. To enable asynchronous parallel index creation, use the ASYNCINDEX= Data Set Option.

Use this option with the DATA step INDEX= option and with the PROC DATASETS MODIFY statement when creating a data set that has several indexes. Either method enables all of the declared indexes to be populated from a single scan of the data set.

The following example shows indexes created in parallel using the DATA step. A simple index is created on variable X and a composite index is created on variables A and B.

data foo.mine(index=(x y=(a b)) asyncindex=yes);   
   x=1;
   a="Doe";
   b=20;
run;

To create multiple indexes in parallel, you must allocate enough utility disk space to create all of the key sorts at the same time. You must also allocate enough memory space. Use the SPDEUTILLOC= system option to allocate disk space and SPDEINDEXSORTSIZE system option in the configuration file or at invocation to allocate additional memory.

The DATASETS procedure has the flexibility to enable batched parallel index creation by using multiple MODIFY groups. Instead of creating all of the indexes at once, which would require a significant amount of space, you can create the indexes in groups as shown in the following example:

proc datasets lib=main;
   modify patients(asyncindex=yes);
      index create number;
      index create class;
   run;
   modify patients(asyncindex=yes)'
      index create lastname firstname;
   run;
   modify patients(asyncindex=yes);
      index create fullname=(lastname firstname);
      index create class_sex=(class sex);
   run;
quit;

Indexes Number and Class are created in parallel, indexes LastName and FirstName are created in parallel, and indexes FullName and Class_Sex are created in parallel.

Parallel Index Updates

The SPD Engine also supports parallel index updating during data set Append operations. Multiple threads enable updates of the data store and index files. The SPD Engine decomposes a data set Append or Insert operation into a set of steps that can be performed in parallel. The level of parallelism attained depends on the number of indexes in the data set. As with parallel index creation, this operation uses memory and disk space for the key sorts that are part of the index append processing. Use system options SPDEINDEXSORTSIZE= to allocate memory and SPDEUTILLOC= to allocate disk space.

Note: The ASYNCINDEX option is not valid for parallel index updates.