Understanding SAS Indexes

Definition of SAS Indexes

An index is an optional file that you can create for a SAS data file in order to provide direct access to specific observations. The index stores values in ascending value order for a specific variable or variables and includes information as to the location of those values within observations in the data file. In other words, an index enables you to locate an observation by value.
For example, suppose that you want the observation with SSN (Social Security number) equal to 123-45-6789:
  • Without an index, SAS accesses observations sequentially in the order in which they are stored in the data file. SAS reads each observation, looking for SSN=123–45–6789 until all observations are read.
  • With an index on variable SSN, SAS accesses the observation directly. SAS satisfies the condition using the index and goes straight to the observation that contains the value without having to read each observation.
You can either create an index when you create a data file or create an index for an existing data file. The data file can be either compressed or uncompressed. For each data file, you can create one or multiple indexes. Once an index exists, SAS treats it as part of the data file. That is, if you add or delete observations or modify values, the index is automatically updated.

Benefits of an Index

In general, SAS can use an index to improve performance in the following situations:
  • For WHERE processing, an index can provide faster and more efficient access to a subset of data. To process a WHERE expression, SAS by default decides whether to use an index or to read the data file sequentially.
  • For BY processing, an index returns observations in the index order, which is in ascending value order, without using the SORT procedure even when the data file is not stored in that order.
    Note: If you use the SORT procedure, the index is not used.
  • For the SET and MODIFY statements, the KEY= option enables you to specify an index in a DATA step to retrieve particular observations in a data file.
In addition, an index can benefit other areas of SAS. In SCL (SAS Component Language), an index improves the performance of table lookup operations. For the SQL procedure, an index enables the software to process certain classes of queries more efficiently (for example, join queries). For the SAS/IML software, you can explicitly specify that an index be used for read, delete, list, or Append operations.
Even though an index can reduce the time required to locate a set of observations, especially for a large data file, there are costs associated with creating, storing, and maintaining the index. When deciding whether to create an index, you must consider increased resource usage, along with the performance improvement.
Note: An index is never used for the subsetting IF statement in a DATA step, or for the FIND and SEARCH commands in the FSEDIT procedure.

The Index File

The index file is a SAS file that has the same name as its associated data file, and that has a member type of INDEX. There is only one index file per data file. That is, all indexes for a data file are stored in a single file.
The index file might be a separate file, or be part of the data file, depending on the operating environment. In any case, the index file is stored in the same SAS library as its data file.
The index file consists of entries that are organized hierarchically and connected by pointers, all of which are maintained by SAS. The lowest level in the index file hierarchy consists of entries that represent each distinct value for an indexed variable, in ascending value order. Each entry contains this information:
  • a distinct value
  • one or more unique record identifiers (referred to as a RID) that identifies each observation containing the value. (Think of the RID as an internal observation number.)
That is, in an index file, each value is followed by one or more RIDs, which identify the observations in the data file that contains the value. (Multiple RIDs result from multiple occurrences of the same value.) For example, the following represents index file entries for the variable LASTNAME:
Index File Entries
Value
Record Identifier
Avery
10
Brown
6, 22, 43
Craig
5, 50
Dunn
1
When an index is used to process a request, such as a WHERE expression, SAS performs a binary search on the index file and positions the index to the first entry that contains a qualified value. SAS then uses the value's RID to read the observation that contains the value. If a value has more than one RID (such as in the value for Brown in the previous example), SAS reads the observation that is pointed to by the next RID in the list. The result is that SAS can quickly locate the observations that are associated with a value or range of values.
For example, using an index to process the WHERE expression, SAS positions the index to the index entry for the first value greater than 20 and uses the value's RID or RIDs to read the observation or observations where age > 20 and age < 35;. SAS then moves sequentially through the index entries reading observations until it reaches the index entry for the value that is equal to or greater than 35.
SAS automatically keeps the index file balanced as updates are made, which means that it ensures a uniform cost to access any index entry, and all space that is occupied by deleted values is recovered and reused.

Types of Indexes

Simple and Composite Indexes

When you create an index, you designate which variable or variables to index. An indexed variable is called a key variable. You can create two types of indexes:
  • a simple index, which consists of the values of one variable
  • a composite index, which consists of the values of more than one variable, with the values concatenated to form a single value
In addition to deciding whether you want a simple index or a composite index, you can also limit an index (and its data file) to unique values and exclude from the index missing values.

Simple Index

The most common index is a simple index, which is an index of values for one key variable. The variable can be numeric or character. When you create a simple index, SAS assigns to the index the name of the key variable.
The following example shows the DATASETS procedure statements that are used to create two simple indexes for variables CLASS and MAJOR in data file COLLEGE.SURVEY:
proc datasets library=college;
   modify survey;
      index create class;
      index create major;
run;
To process a WHERE expression using an index, SAS uses only one index. When the WHERE expression has multiple conditions using multiple key variables, SAS determines which condition qualifies the smallest subset. For example, suppose that COLLEGE.SURVEY contains the following data:
  • 42,000 observations contain CLASS=12
  • 6,000 observations contain MAJOR='Biology'
  • 350 observations contain both CLASS=12 and MAJOR='Biology'
With simple indexes on CLASS and MAJOR, SAS would select MAJOR to process the following WHERE expression:
where class=12 and major='Biology';

Composite Index

A composite index is an index of two or more key variables with their values concatenated to form a single value. The variables can be numeric, character, or a combination. An example is a composite index for the variables LASTNAME and FRSTNAME. A value for this index consists of the value for LASTNAME immediately followed by the value for FRSTNAME from the same observation. When you create a composite index, you must specify a unique index name.
The following example shows the DATASETS procedure statements that are used to create a composite index for the data file COLLEGE.MAILLIST, specifying two key variables: ZIPCODE and SCHOOLID.
proc datasets library=college;
   modify maillist;
      index create zipid=(zipcode schoolid);
run;
Often, only the first variable of a composite index is used. For example, for a composite index on ZIPCODE and SCHOOLID, the following WHERE expression can use the composite index for the variable ZIPCODE because it is the first key variable in the composite index:
where zipcode = 78753;
However, you can take advantage of all key variables in a composite index by how you construct the WHERE expression, which is referred to as compound optimization. Compound optimization is the process of optimizing multiple WHERE expression conditions using a single composite index. If you issue the following WHERE expression, the composite index is used to find all occurrences where the ZIP code is 78753 and the school identification number is 55. In this way, all of the conditions are satisfied with a single search of the index:
where zipcode = 78753 and schoolid = 55;
When you are deciding whether to create a simple index or a composite index, consider how you will access the data. If you often access data for a single variable, a simple index will do. But if you frequently access data for multiple variables, a composite index could be beneficial.

Unique Values

Often it is important to require that values for a variable be unique, like Social Security number and employee number. You can declare unique values for a variable by creating an index for the variable and including the UNIQUE option. A unique index guarantees that values for one variable or the combination of a composite group of variables remain unique for every observation in the data file. If an update tries to add a duplicate value to that variable, the update is rejected.
The following example creates a simple index for the variable IDNUM and requires that all values for IDNUM be unique:
proc datasets library=college;
   modify student;
      index create idnum / unique;
run;

Missing Values

If a variable has a large number of missing values, it might be desirable to keep them from using space in the index. Therefore, when you create an index, you can include the NOMISS option to specify that missing values are not maintained by the index.
The following example creates a simple index for the variable RELIGION and specifies that the index does not maintain missing values for the variable:
proc datasets library=college;
   modify student;
      index create religion / nomiss;
run;
In contrast to the UNIQUE option, observations with missing values for the key variable can be added to the data file, even though the missing values are not added to the index.
SAS does not use an index that was created with the NOMISS option to process a BY statement or to process a WHERE expression that qualifies observations that contain missing values. If no missing values are present, SAS considers using the index in processing the BY statement or WHERE expression.
In the following example, the index AGE was created with the NOMISS option and observations exist that contain missing values for the variable AGE. In this case, SAS does not use the index:
proc print data=mydata.employee;
   where age < 35;
run;

Deciding Whether to Create an Index

Costs of an Index

An index exists to improve performance. However, an index conserves some resources at the expense of others. Therefore, you must consider costs associated with creating, using, and maintaining an index. The following topics provide information about resource usage and give you some guidelines for creating indexes.
When you decide whether to create an index, you must consider CPU cost, I/O cost, buffer requirements, and disk space requirements.

CPU Cost

Additional CPU time is necessary to create an index as well as to maintain the index when the data file is modified. That is, for an indexed data file, when a value is added, deleted, or modified, it must also be added, deleted, or modified in the appropriate index(es).
When SAS uses an index to read an observation from a data file, there is also increased CPU usage. The increased usage results from SAS using a more complicated process than is used when SAS retrieves data sequentially. Although CPU usage is greater, you benefit from SAS reading only those observations that meet the conditions. Note that increased CPU usage is why using an index is more expensive when there is a larger number of observations that meet the conditions.
Note: To compare CPU usage with and without an index, for some operating environments, you can issue the STIMER or FULLSTIMER system options in order to write performance statistics to the SAS log.

I/O Cost

Using an index to read observations from a data file can increase the number of I/O (input/output) requests compared to reading the data file sequentially. For example, processing a BY statement with an index might increase I/O count, but you save in not having to issue the SORT procedure. For WHERE processing, SAS considers I/O count when deciding whether to use an index.
  1. SAS does a binary search on the index file and positions the index to the first entry that contains a qualified value.
  2. SAS uses the value's RID (identifier) to directly access the observation containing the value. SAS transfers the observation between external storage to a buffer, which is the memory into which data is read or from which data is written. The data is transferred in pages, which is the amount of data (the number of observations) that can be transferred for one I/O request; each data file has a specified page size.
  3. SAS then continues the process until the WHERE expression is satisfied. Each time SAS accesses an observation, the data file page containing the observation must be read into memory if it is not already there. Therefore, if the observations are on multiple data file pages, an I/O operation is performed for each observation.
The result is that the more random the data, the more I/Os are required to use the index. If the data is ordered more like the index, which is in ascending value order, a smaller number of I/Os are required to access the data.
The number of buffers determines how many pages of data can simultaneously be in memory. Frequently, the larger the number of buffers, the smaller the number of I/Os that are required. For example, if the page size is 4096 bytes and one buffer is allocated, then one I/O transfers 4096 bytes of data (or one page). To reduce I/Os, you can increase the page size but you need a larger buffer. To reduce the buffer size, you can decrease the page size but you use more I/Os.
For information about data file characteristics like the data file page size and the number of data file pages, issue the CONTENTS procedure (or use the CONTENTS statement in the DATASETS procedure). With this information, you can determine the data file page size and experiment with different sizes. Note that the information that is available from PROC CONTENTS depends on the operating environment.
The BUFSIZE= data set option (or system option) sets the permanent page size for a data file when it is created. The page size is the amount of data that can be transferred for an I/O operation to one buffer. The BUFNO= data set option (or system option) specifies how many buffers to allocate for a data file and for the overall system for a given execution of SAS. That is, BUFNO= is not stored as a data set attribute.

Buffer Requirements

In addition to the resources that are used to create and maintain an index, SAS also requires additional memory for buffers when an index is actually used. Opening the data file opens the index file but none of the indexes. The buffers are not required unless SAS uses the index but they must be allocated in preparation for the index that is being used.
The number of buffers that are allocated depends on the number of levels in the index tree and in the data file open mode. If the data file is open for input, the maximum number of buffers is three; for update, the maximum number is four. (Note that these buffers are available for other uses; they are not dedicated to indexes.)
The IBUFSIZE= system option specifies the page size on disk for an index file when it is created. The default setting causes SAS to use the minimum optimal page size for the operating environment. Typically, you do not need to specify an index page size. However, there are situations that could require a different page size. For more information, see the IBUFSIZE= System Option in SAS System Options: Reference.
The IBUFNO= system option specifies an optional number of extra buffers to be allocated when navigating an index file. SAS automatically allocates a minimal number of buffers. Typically, you do not need to specify extra buffers. However, using IBUFNO= to specify extra buffers could improve execution time by limiting the number of input/output operations that are required for a particular index file. The improvement in execution time, however, comes at the expense of increased memory consumption. For more information, see the IBUFNO= System Option in SAS System Options: Reference.

Disk Space Requirements

Additional disk space is required to store the index file. This file might show up as a separate file or appear to be part of the data file, depending on the operating environment.
For information about the index file size, issue the CONTENTS procedure (or the CONTENTS statement in the DATASETS procedure). Note that the available information from PROC CONTENTS depends on the operating environment.

Guidelines for Creating Indexes

Data File Considerations

  • For a small data file, sequential processing is often just as efficient as index processing. Do not create an index if the data file page count is less than three pages. It would be faster to access the data sequentially. To see how many pages are in a data file, use the CONTENTS procedure (or use the CONTENTS statement in the DATASETS procedure). Note that the information that is available from PROC CONTENTS depends on the operating environment.
  • Consider the cost of an index for a data file that is frequently changed. If you have a data file that changes often, the overhead associated with updating the index after each change can outweigh the processing advantages you gain from accessing the data with an index.
  • Create an index when you intend to retrieve a small subset of observations from a large data file (for example, less than 25% of all observations). When this occurs, the cost of processing data file pages is lower than the overhead of sequentially reading the entire data file. The smaller the subset, the larger the performance gains.
  • To reduce the number of I/Os performed when you create an index, first sort the data by the key variable. Then to improve performance, maintain the data file in sorted order by the key variable. This technique reduces the I/Os by grouping like values together. That is, the more ordered the data file is with respect to the key variable, the more efficient the use of the index. If the data file has more than one index, sort the data by the most frequently used key variable.
  • An index might not be necessary to optimize a WHERE expression if the data is sorted appropriately in order to satisfy the condition. To process a WHERE expression without an index, SAS first checks for the sort indicator that is stored with the file from a previous SORT procedure. If the sort indicator is appropriate, SAS stops reading the file once there are no more values that satisfy the WHERE expression. For example, consider a file that is sorted by Age, without an index. To process the expression where age le 25, SAS stops reading observations after it finds an observation that is greater than 25. Note that while SAS can determine when to stop reading observations, if there is no index, there is no indication where to begin. Without an index, SAS always begins with the first observation, which can require reading a lot of observations.

Index Use Considerations

  • Keep the number of indexes per data file to a minimum to reduce disk storage and to reduce update costs.
  • Consider how often your applications use an index. An index must be used often in order to make up for the resources that are used in creating and maintaining it. That is, do not rely solely on resource savings from processing a WHERE expression. Take into consideration the resources that it takes to actually create the index and to maintain it every time the data file is changed.
  • When you create an index to process a WHERE expression, do not try to create one index that is used to satisfy all queries. If there are several variables that appear in queries, those queries might be best satisfied with simple indexes on the most discriminating of those variables.

Key Variable Candidates

In most cases, multiple variables are used to query a data file. However, it probably would be a mistake to index all variables in a data file, as certain variables are better candidates than others:
  • The variables to be indexed should be variables that are used in queries. That is, your application should require selecting small subsets from a large file, and the most common selection variables should be considered as candidate key variables.
  • A variable is a good candidate for indexing when the variable can be used to precisely identify the observations that satisfy a WHERE expression. That is, the variable should be discriminating, which means that the index should select the fewest possible observations. For example, variables such as AGE, FRSTNAME, and GENDER are not discriminating because it is possible for a large representation of the data to have the same age, first name, and gender. However, a variable such as LASTNAME is a good choice because it is less likely that many employees share the same last name.
    For example, consider a data file with variables LASTNAME and GENDER.
    • If many queries against the data file include LASTNAME, then indexing LASTNAME could prove to be beneficial because the values are usually discriminating. However, the same reasoning would not apply if you issued a large number of queries that included GENDER. The GENDER variable is not discriminating (because perhaps half the population is male and half is female).
    • However, if queries against the data file most often include both LASTNAME and GENDER as shown in the following WHERE expression, then creating a composite index on LASTNAME and GENDER could improve performance.
      where lastname='LeVoux' and gender='F';
      Note that when you create a composite index, the first key variable should be the most discriminating.

Creating an Index

Overview of Creating Indexes

You can create one index for a data file, which can be either a simple index or a composite index, and you can create multiple indexes, which can be multiple simple indexes, multiple composite indexes, or a combination of both simple and composite.
  1. You request to create an index for one or multiple variables using a method such as the INDEX CREATE statement in the DATASETS procedure.
  2. SAS reads the data file one observation at a time, extracts values and RIDs for each key variable, and places them in the index file.
SAS ensures that the values that are placed in the index are successively the same or increasing. SAS determines whether the data is already sorted by the key variables in ascending order by checking the sort indicator in the data file, which is an attribute of the file that indicates how the data is sorted. The sort indicator is stored with the SAS data file descriptor information and is set from a previous SORT procedure or SORTEDBY= data set option.
If the values in the sort indicator are in ascending order, SAS does not sort the values for the index file and avoids the resource. Note that SAS always validates that the data is sorted as indicated. If not, the index is not created. For example, if the sort indicator was set from a SORTEDBY= data set option and the data is not sorted as indicated, an error occurs and a message is written to the SAS log stating that the index was not created because values are not sorted in ascending order.
If the values in the sort indicator are not in ascending order, SAS sorts the data that is included in the index file in ascending value order. To sort the data, SAS follows this procedure:
  1. SAS first attempts to sort the data using the thread-enabled sort. By dividing the sorting into separately executable processes, the time to sort the data can be reduced. To use the thread-enabled sort, the index must be sufficiently large (which is determined by SAS), the SAS system option CPUCOUNT= must be set to more than one processor, and the THREADS system option must be enabled. Adequate memory must be available for the thread-enabled sort. If not enough memory is available, SAS reduces the number of threads to one and begins the sort process again, which increases the time to create the index.
  2. If the thread-enabled sort cannot be done, SAS uses the unthreaded sort.
Note: To display messages regarding what type of sort is used, memory and resource information, and the status of the index being created, set the SAS system option MSGLEVEL=I; that is:
options msglevel=i;

Using the DATASETS Procedure

The DATASETS procedure provides statements that enable you to create and delete indexes. In the following example, the MODIFY statement identifies the data file, the INDEX DELETE statement deletes two indexes, and the two INDEX CREATE statements specify the variables to index, with the first INDEX CREATE statement specifying the options UNIQUE and NOMISS:
proc datasets library=mylib;
modify employee;
   index delete salary age;
   index create empnum / unique nomiss;
   index create names=(lastname frstname);
Note: If you delete and create indexes in the same step, place the INDEX DELETE statement before the INDEX CREATE statement so that space occupied by deleted indexes can be reused during index creation.

Using the INDEX= Data Set Option

To create indexes in a DATA step when you create the data file, use the INDEX= data set option. The INDEX= data set option also enables you to include the NOMISS and UNIQUE options. The following example creates a simple index on the variable STOCK and specifies UNIQUE:
data finances(index=(stock /unique));
The next example uses the variables SSN, CITY, and STATE to create a simple index named SSN and a composite index named CITYST:
data employee(index=(ssn cityst=(city state)));

Using the SQL Procedure

The SQL procedure supports index creation and deletion and the UNIQUE option. Note that the variable list requires that variable names be separated by commas (which is an SQL convention) instead of blanks (which is a SAS convention).
The DROP INDEX statement deletes indexes. The CREATE INDEX statement specifies the UNIQUE option, the name of the index, the target data file, and the variable or variables to be indexed. For example:
drop index salary from employee;
create unique index empnum on employee (empnum);
create index names on employee (lastname, frstname);

Using Other SAS Products

You can also create and delete indexes using other SAS utilities and products, such as SAS/CONNECT software, SAS/IML software, SAS Component Language, and SAS/Warehouse Administrator software.

Using an Index for WHERE Processing

Overview of Using an Index for WHERE Processing

WHERE processing conditionally selects observations for processing when you issue a WHERE expression. Using an index to process a WHERE expression improves performance and is referred to as optimizing the WHERE expression.
To process a WHERE expression, by default SAS decides whether to use an index or read all the observations in the data file sequentially. To make this decision, SAS does the following:
  1. Identifies an available index or indexes.
  2. Estimates the number of observations that would be qualified. If multiple indexes are available, SAS selects the index that returns the smallest subset of observations.
  3. Compares resource usage to decide whether it is more efficient to satisfy the WHERE expression by using the index or by reading all the observations sequentially.
    Note: SAS considers several factors when deciding whether to use an index. Therefore, experimentation is the best way to determine the optimal performance. If you have a WHERE expression that is used repeatedly, compare the results using an index and without an index in order to determine which method provides the best performance. You can control index usage with the IDXWHERE= and IDXNAME= data set options. See Controlling WHERE Processing Index Usage with Data Set Options.

Identifying an Available Index or Indexes

The first step for SAS in deciding whether to use an index to process a WHERE expression is to identify if the variable or variables included in the WHERE expression are key variables (that is, have an index). Even though a WHERE expression can consist of multiple conditions that specify different variables, SAS uses only one index to process the WHERE expression. SAS selects the index that satisfies the most conditions and qualifies the fewest observations:
  • Usually SAS selects one condition. The variable specified in the condition has either a simple index or is the first key variable in a composite index.
  • However, you can take advantage of multiple key variables in a composite index by constructing an appropriate WHERE expression, referred to as compound optimization. See Compound Optimization.
SAS attempts to use an index for the following types of conditions:
WHERE Conditions That Can Be Optimized
Condition
Valid for Compound Optimization
Examples
comparison operators, which include the EQ operator; directional comparisons like less than or greater than; and the IN operator
yes
where empnum eq 3374;
where empnum < 2000;
where state in ('NC','TX');
comparison operators with NOT
yes
where empnum ^= 3374;
where x not in (5,10);
comparison operators with the colon modifier
yes
where lastname gt: 'Sm';
CONTAINS operator
no
where lastname contains 'Sm';
fully bounded range conditions specifying both an upper and lower limit, which includes the BETWEEN-AND operator
yes
where 1 < x < 10;
where empnum between 500 and 1000;
pattern-matching operators LIKE and NOT LIKE
no
where frstname like '%Rob_%'
IS NULL or IS MISSING operator
no
where name is null;
where idnum is missing;
TRIM function
no
where trim(state)='Texas';
SUBSTR (left of =) function in the form of:
WHERE SUBSTR (variable, position <,length>)='string';
when the following conditions are met:
position specifies a numeric constant for the beginning character position that is less than or equal to the variable length.
length specifies a numeric constant for the length of string . The length plus position cannot be larger than the variable length plus 1.
no
where substr (month,4,5)='ember' and (city='Charleston' or city='Atlanta');
Note: Conditions are not optimized with an index for arithmetic operators, a variable-to-variable condition, and the sounds-like operator.
The following examples illustrate optimizing a single condition:
  • The following WHERE expressions could use a simple index on the variable MAJOR:
    where major in ('Biology', 'Chemistry', 'Agriculture');
    where class=11 and major in ('Biology', 'Agriculture');
  • With a composite index on variables ZIPCODE and SCHOOLID, SAS could use the composite index to satisfy the following conditions because ZIPCODE is the first key variable in the composite index:
    where zipcode = 78753;
    However, the following condition cannot use the composite index because the variable SCHOOLID is not the first key variable in the composite index:
    where schoolid gt 1000;

Compound Optimization

Compound optimization is the process of optimizing multiple WHERE expression conditions using a single composite index. Using a single index to optimize the conditions can greatly improve performance.
For example, suppose there is a composite index for LASTNAME and FRSTNAME. If you execute the following WHERE expression, SAS uses the concatenated values for the first two variables, then SAS further evaluates each qualified observation for the EMPID value:
where lastname eq 'Smith' and frstname eq 'John' and empid=3374;
For compound optimization to occur, all of the following must be true.
  • At least the first two key variables in the composite index must be used in valid WHERE expression conditions. For a list of conditions that are valid for compound optimization, see WHERE Conditions That Can Be Optimized.
  • At least one condition must use the EQ or IN operator. For example, you cannot have all range conditions.
  • The conditions must be connected with the AND or the OR logical operator:
    • When conditions are connected with AND, the conditions can occur in any order. For example:
      where lastname eq 'Smith'and frstname eq 'John';
    • When conditions are connected with OR, the conditions must specify the same variable. For example:
      where frstname eq 'John' and 
         (lastname eq 'Smith' or lastname eq 'Jones');
      Note: SAS transforms the OR conditions that specify the same variable into a single condition that uses the IN operator. For the above WHERE expression, SAS converts the two OR conditions into lastname IN ('Smith','Jones'), and then uses the composite index for the variables Frstname and Lastname in order to select the observations where Frstname is John and Lastname is Smith or Jones.
For the following examples, assume there is a composite index for variables I, J, and CH:
  • The following WHERE expression conditions are compound optimized because every condition specifies a variable that is in the composite index, and each condition uses one of the supported operators. SAS positions the composite index to the first entry that meets all three conditions and retrieves only observations that satisfy all three conditions.
    where I = 1 and J not in (3,4) and 'abc' < CH;
  • For the following WHERE expression, the first two conditions are compound optimized. After retrieving a subset of observations that satisfy the first two conditions, SAS examines the subset and eliminates any observations that fail to match the third condition.
    where I in (1,4) and J = 5 and K like '%c';
  • This WHERE expression can be compound optimized for variables I and J. After retrieving observations that satisfy the second and third conditions, SAS examines the subset and eliminates those observations that do not satisfy the first condition.
    where X < 5 and I = 1 and J = 2;
  • The following WHERE expression can be compound optimized on I and J:
    where X < Z and I = 1 and J = 2;
  • The following WHERE expression cannot be compound optimized neither J or K is the left-most variable in the composite index:
    where J = 1 and K = 2;
  • The following WHERE expression cannot be optimized because the comparison condition on the variable I is variable-to-variable, which is not supported for index processing:
    where I < K and J in (3,4) and CH = 'abc';
Compound optimization can occur for a NOMISS composite index as long as at least one condition does not qualify missing values. That is, compound optimization cannot occur on a NOMISS index, which is an index that does not maintain missing values, if every condition could result in a missing value. The following examples illustrate compound optimization with a NOMISS composite index for variables I, J, and K.
  • The following WHERE expression can be compound optimized, because the condition K = 1 cannot result in a missing value:
    where I in (.,5) and J < 4 and K = 1;
  • This WHERE expression cannot be compound optimized, because each condition could result in a missing value:
    where I in (.,5) and J < 4 and K <= 1;
  • The following WHERE expression cannot be compound optimized, because each condition could result in a missing value. The condition J < 4 qualifies observations as J = ., and those observations are not represented in the NOMISS composite index:
    where I = . and J < 4 and .A < K < .D;

Estimating the Number of Qualified Observations

Once SAS identifies the index or indexes that can satisfy the WHERE expression, the software estimates the number of observations that will be qualified by an available index. When multiple indexes exist, SAS selects the one that seems to produce the fewest qualified observations.
SAS estimates the number of observations that will be qualified by using stored statistics called cumulative percentiles (or centiles for short). Centiles information represents the distribution of values in an index so that SAS does not have to assume a uniform distribution. To print centiles information for an indexed data file, include the CENTILES option in PROC CONTENTS (or in the CONTENTS statement in the DATASETS procedure).
Note that, by default, SAS does not update centiles information after every data file change. When you create an index, you can include the UPDATECENTILES option to specify when centiles information is updated. That is, you can specify that centiles information be updated every time the data file is closed, when a certain percentage of values for the key variable have been changed, or never. In addition, you can also request that centiles information is updated immediately, regardless of the value of UPDATECENTILES, by issuing the INDEX CENTILES statement in PROC DATASETS.
As a general rule, SAS uses an index if it estimates that the WHERE expression will select approximately one-third or less of the total number of observations in the data file.
Note: For performance purposes, the following can occur when SAS estimates the number of qualified observations:
  • If the number of qualified observations is less than 3% of the data file (or if no observations are qualified), SAS automatically uses the index, and does not bother comparing resource usage.
  • If all of the observations are qualified, by default SAS does not use the index unless the IDXNAME= or IDXWHERE= data set option is specified.

Comparing Resource Usage

Once SAS estimates the number of qualified observations and selects the index that qualifies the fewest observations, SAS must then decide whether it is faster (cheaper) to satisfy the WHERE expression by using the index or by reading all of the observations sequentially. SAS makes this determination as follows:
  • If only a few observations are qualified, it is more efficient to use the index than to do a sequential search of the entire data file.
  • If most or all of the observations qualify, then it is more efficient to simply sequentially search the data file than to use the index.
This decision is much like a reader deciding whether to use an index at the back of a document. A document's index is designed to enable a reader to locate a topic along with the specific page number. Using the index, the reader would go to a specific page number and read only about a specific topic. If the document covers 42 topics and the reader is interested in only a couple of topics, then the index saves time by preventing the reader from reading other topics. However, if the reader is interested in 39 topics, searching the index for each topic would take more time than simply reading the entire document.
To compare resource usage, SAS does the following:
  1. SAS predicts the number of I/Os that it takes to satisfy the WHERE expression using the index. To do so, SAS positions the index to the first entry that contains a qualified value. In a buffer management simulation that takes into account the current number of available buffers, the RIDs (identifiers) on that index page are processed, indicating how many I/Os it takes to read the observations in the data file.
    If the observations are randomly distributed throughout the data file, the observations are located on multiple data file pages. This means that an I/O is needed for each page. Therefore, the more random the data in the data file, the more I/Os it takes to use the index. If the data in the data file is ordered more like the index, which is in ascending value order, a smaller number of I/Os are needed to use the index.
  2. SAS calculates the I/O cost of a sequential pass of the entire data file and compares the two resource costs.
Factors that affect the comparison include the size of the subset relative to the size of the data file, data file value order, data file page size, the number of allocated buffers, and the cost to uncompress a compressed data file for a sequential read.
Note: If comparing resource costs results in a tie, SAS chooses the index.

Controlling WHERE Processing Index Usage with Data Set Options

You can control index usage for WHERE processing with the IDXWHERE= and IDXNAME= data set options.
The IDXWHERE= data set option overrides the software's decision regarding whether to use an index to satisfy the conditions of a WHERE expression as follows:
  • IDXWHERE=YES tells SAS to decide which index is the best for optimizing a WHERE expression, disregarding the possibility that a sequential search of the data file might be more resource efficient.
  • IDXWHERE=NO tells SAS to ignore all indexes and satisfy the conditions of a WHERE expression by sequentially searching the data file.
  • Using an index to process a BY statement cannot be overridden with IDXWHERE=.
The following example tells SAS to decide which index is the best for optimizing the WHERE expression. SAS disregards the possibility that a sequential search of the data file might be more resource efficient.
data mydata.empnew;
   set mydata.employee (idxwhere=yes);
   where empnum < 2000;
For details, see the IDXWHERE data set option in SAS Data Set Options: Reference.
The IDXNAME= data set option directs SAS to use a specific index in order to satisfy the conditions of a WHERE expression.
By specifying IDXNAME=index-name, you are specifying the name of a simple or composite index for the data file.
The following example uses the IDXNAME= data set option to direct SAS to use a specific index to optimize the WHERE expression. SAS disregards the possibility that a sequential search of the data file might be more resource efficient. SAS does not attempt to determine whether the specified index is the best one. (Note that the EMPNUM index was not created with the NOMISS option.)
data mydata.empnew;
   set mydata.employee (idxname=empnum);
   where empnum < 2000;
For details, see the IDXNAME data set option in SAS Data Set Options: Reference.
Note: IDXWHERE= and IDXNAME= are mutually exclusive. Using both options results in an error.

Displaying Index Usage Information in the SAS Log

To display information in the SAS log regarding index usage, change the value of the MSGLEVEL= system option from its default value of N to I. When you issue options msglevel=i;, the following occurs:
  • If an index is used, a message displays the name of the index.
  • If an index is not used but one exists that could optimize at least one condition in the WHERE expression, messages provide suggestions as to what you can do to influence SAS to use the index. For example, a message could suggest sorting the data file into index order or specifying more buffers.
  • A message displays the IDXWHERE= or IDXNAME= data set option value if the setting can affect index processing.

Using an Index with SAS Views

You cannot create an index for a SAS view; it must be a data file. However, if a SAS view is created from an indexed data file, index usage is available. That is, if the view definition includes a WHERE expression using a key variable, then SAS attempts to use the index. , There are other ways to take advantage of a key variable when using a SAS view.
In this example, you create an SQL view named STAT from data file CRIME, which has the key variable STATE. In addition, the view definition includes a WHERE expression:
proc sql;
   create view stat as
   select * from crime
   where murder > 7;
quit;
If you issue the following PRINT procedure, which refers to the SQL view, along with a WHERE statement that specifies the key variable STATE, SAS cannot optimize the WHERE statement with the index. SQL views cannot join a WHERE expression that was defined in the view to a WHERE expression that was specified in another procedure, DATA step, or SCL:
proc print data=stat;
   where state > 42;
run;
However, if you issue PROC SQL with an SQL WHERE clause that specifies the key variable STATE, then the SQL view can join the two conditions, which enables SAS to use the index STATE:
proc sql;
select * from stat where state > 42;
quit;

Using an Index for BY Processing

BY processing enables you to process observations in a specific order according to the values of one or more variables that are specified in a BY statement. Indexing a data file enables you to use a BY statement without sorting the data file. By creating an index based on one or more variables, you ensure that observations are processed in ascending numeric or character order. Specify in the BY statement the variable or list of variables that are indexed.
For example, if an index exists for LASTNAME, the following BY statement would use the index to order the values by last names:
proc print;
   by lastname;
When you specify a BY statement, SAS looks for an appropriate index. If one exists, the software automatically retrieves the observations from the data file in indexed order.
A BY statement uses an index in the following situations:
  • The BY statement consists of one variable that is the key variable for a simple index or the first key variable in a composite index.
  • The BY statement consists of two or more variables and the first variable is the key variable for a simple index or the first key variable in a composite index.
For example, if the variable MAJOR has a simple index, the following BY statements use the index to order the values by MAJOR:
by major;
by major state;
If a composite index named ZIPID exists consisting of the variables ZIPCODE and SCHOOLID, the following BY statements use the index:
by zipcode;
by zipcode schoolid;
by zipcode schoolid name;
However, the composite index ZIPID is not used for these BY statements:
by schoolid;
by schoolid zipcode;
In addition, a BY statement does not use an index in these situations:
  • The BY statement includes the DESCENDING or NOTSORTED option.
  • The index was created with the NOMISS option.
  • The data file is physically stored in sorted order based on the variables specified in the BY statement.
Note: Using an index to process a BY statement might not always be more efficient than simply sorting the data file, particularly if the data file has a high blocking factor of observations per page. Therefore, using an index for a BY statement is generally for convenience, not performance.

Using an Index for Both WHERE and BY Processing

If both a WHERE expression and a BY statement are specified, SAS looks for one index that satisfies requirements for both. If such an index is not found, the BY statement takes precedence.
With a BY statement, SAS cannot use an index to optimize a WHERE expression if the optimization would invalidate the BY order. For example, the following statements could use an index on the variable LASTNAME to optimize the WHERE expression because the order of the observations returned by the index does not conflict with the order required by the BY statement:
proc print;
  by lastname;
  where lastname >= 'Smith';
run;
However, the following statements cannot use an index on LASTNAME to optimize the WHERE expression because the BY statement requires that the observations be returned in EMPID order:
proc print;
   by empid;
   where lastname = 'Smith';
run;

Specifying an Index with the KEY= Option for SET and MODIFY Statements

The SET and MODIFY statements provide the KEY= option, which enables you to specify an index in a DATA step to retrieve particular observations in a data file.
The following MODIFY statement shows how to use the KEY= option to take advantage of the fact that the data file INVTY.STOCK has an index on the variable PARTNO. Using the KEY= option tells SAS to use the index to directly access the correct observations to modify.
modify invty.stock key=partno;
Note: A BY statement is not allowed in the same DATA step with the KEY= option, and WHERE processing is not allowed for a data file with the KEY= option.

Taking Advantage of an Index

Applications that typically do not use indexes can be rewritten to take advantage of an index. For example:
  • Consider replacing a subsetting IF statement (which never uses an index) with a WHERE statement.
    CAUTION:
    However, be careful because IF and WHERE statements are processed differently and might produce different results in DATA steps that use the SET, MERGE, or UPDATE statements.
    This is because the WHERE statement selects observations before they are brought into the Program Data Vector (PDV), whereas the subsetting IF statement selects observations after they are read into the PDV.
  • Consider using the WHERE command in the FSEDIT procedure in place of the SEARCH and FIND commands.

Procedures and SAS Operations That Maintain Indexes

Displaying Data File Information

The CONTENTS procedure (or the CONTENTS statement in PROC DATASETS) reports the following types of information.
  • number and names of indexes for a data file
  • the names of key variables
  • the options in effect for each key variable
  • data file page size
  • number of data file pages
  • centiles information (using the CENTILES option)
  • amount of disk space used by the index file
Note: The available information depends on the operating environment.
Output of PROC CONTENTS
Output of PROC CONTENTS
Output of PROC CONTENTS
Output of PROC CONTENTS

Copying an Indexed Data File

When you copy an indexed data file with the COPY procedure (or the COPY statement of the DATASETS procedure), you can specify whether the procedure also recreates the index file for the new data file with the INDEX=YES|NO option; the default is YES, which recreates the index. However, recreating the index does increase the processing time for the PROC COPY step.
If you copy from disk to disk, the index is recreated. If you copy from disk to tape, the index is not recreated on tape. However, after copying from disk to tape, if you then copy back from tape to disk, the index can be recreated. Note that if you move a data file with the MOVE option in PROC COPY, the index file is deleted from IN= library and recreated in OUT= library.
The CPORT procedure also has INDEX=YES|NO to specify whether to export indexes with indexed data files. By default, PROC CPORT exports indexes with indexed data files. The CIMPORT procedure, however, does not handle the index file at all, and the index(es) must be recreated.

Updating an Indexed Data File

Each time that values in an indexed data file are added, modified, or deleted, SAS automatically updates the index. The following activities affect an index as indicated:
Maintenance Tasks and Index Results
Task
Result
delete a data set
index file is deleted
rename a data set
index file is renamed
rename key variable
simple index is renamed
delete key variable
simple index is deleted
add observation
index entries are added
delete observations
index entries are deleted and space is recovered for reuse
update observations
index entries are deleted and new ones are inserted
Note: Use SAS to perform additions, modifications, and deletions to your data sets. Using operating environment commands to perform these operations makes your files unusable.

Sorting an Indexed Data File

You can sort an indexed data file only if you direct the output of the SORT procedure to a new data file so that the original data file remains unchanged. However, the new data file is not automatically indexed.
Note: If you sort an indexed data file with the FORCE option, the index file is deleted.

Adding Observations to an Indexed Data File

Adding observations to an indexed data file requires additional processing. SAS automatically keeps the values in the index consistent with the values in the data file.

Multiple Occurrences of Values

An index that is created without the UNIQUE option can result in multiple occurrences of the same value, which results in multiple RIDs for one value. For large data files with many multiple occurrences, the list of RIDs for a given value might require several pages in the index file. Because the RIDs are stored in physical order, any new observation added to the data file with the given value is stored at the end of the list of RIDs. Navigating through the index to find the end of the RID list can cause many I/O operations.
SAS remembers the previous position in the index so that when inserting more occurrences of the same value, the end of the RID list is found quickly.

Appending Data to an Indexed Data File

SAS provides performance improvements when appending a data file to an indexed data file. SAS suspends index updates until all observations are added, and then updates the index with data from the newly added observations. See the APPEND statement in the DATASETS procedure in Base SAS Procedures Guide.

Recovering a Damaged Index

An index can become damaged for many of the same reasons that a data file or catalog can become damaged. If a data file becomes damaged, use the REPAIR statement in PROC DATASETS to repair the data file or recreate any missing indexes. For example:
proc datasets library=mylib;
   repair mydata;
run;

Indexes and Integrity Constraints

Integrity constraints can also use indexes. When an integrity constraint is created that uses an index, if a suitable index already exists, it is used. Otherwise, a new index is created. When an index is created, it is marked as being “owned” by the creator, which can be either the user or an integrity constraint.
If either the user or an integrity constraint requests creation of an index that already exists and is owned by the other, the requestor is also marked as an “owner” of the index. If an index is owned by both, then a request by either to delete the index results in removing only the requestor as owner. The index is deleted only after both the integrity constraint and the user have requested the index's deletion. A note in the log indicates when an index cannot be deleted.

Indexes and CEDA Processing

When processing a SAS data file with CEDA, indexes are not supported. For example, if you move a SAS data file with a defined index from one operating environment like Windows to a different operating environment like UNIX, CEDA translates the file for you, but the index is not available. Therefore, WHERE optimization for the file is not supported.