The hybrid index uses
two data files. The .hbx file contains the global portion of the hybrid
index. You can estimate space consumption approximately for the .hbx
component of a hybrid index as follows:
If the index is NOT
unique:
number_of_discrete_values_in_the_index * (22.5 +
(length_of_columns_composing_the_index))
If the index IS unique:
number_of_descrete_value_in_the_index * (6 +
(length_of_columns_composing_the_index))
The .idx file contains
the per-value segment lists and bitmap portion of the hybrid index.
Estimating disk space consumption for this file is much more difficult
than the .hbx file. This is because the .idx file size depends on
the distribution of the key values across the rows of the table. The
size also depends on the number of updates and appends performed on
the index. The .idx files of an indexed table initially created with
"n" rows consumes considerably less space than the .idx files of an
identical table created and with several append or updates performed
afterward. The wasted space in the latter example can be reclaimed
by reorganizing the index.
With the above in mind,
a worst case estimate for space consumption of the .idx component
of a hybrid index is:
8192 + (number_of_descrete_values_in_more_than_one_obs * (24 +
(avg_number_of_segments_per_value * (16 + (seg_size / 8)))))
This estimate does not
consider the compression factor for the bitmaps, which could be substantial.
The fewer occurrences of a value in a given segment, the more the
bitmap for that segment can be compressed. The uncompressed bitmap
size is the (seg_size/8) component of the algorithm.
To estimate the disk
usage for a nonunique hybrid index on a column with a length of 8,
where the column contains 1024 discrete values, and each value exists
in an average of 4 segments, where each segment occupies 8192 rows,
the calculation would be:
.hyb_size = 1024 * (22.5 + 8) = 31323 bytes
.idx_size = 8192 + (10000 * (24 + (4 * (16 + (8192/8))))) = 4343808 bytes
To estimate the disk
usage of a unique hybrid index on a column with a length of 8 that
contains 100000 values would be:
.hyb_size = 100000 * (6 + 8) = 1400000 bytes
.idx_size = 8192 + (0 * (...)) = 8192 bytes
Note: The size of the .idx file
for a unique index will always be 8192 bytes because the unique index
contains no values that are in more than one observation.
There is a hidden workspace
requirement when creating indexes or when appending indexes in SPD
Server. This need arises from the fact that SPD Server sorts the rows
of the table by the key value before adding the key values to the
hybrid index. This greatly improves the index create and append performance
but comes with a price requiring temporary disk space to hold the
sorted keys while the index create and append is in progress. This
workspace is controlled for SPD Server by the WORKPATH= parameter
in the SPD Server host parameter file.
You can estimate workspace
requirements for index creation as follows for a given index "x":
SpaceBytes ~ #rows * SortLength(x)
where #rows =
Number of rows in the table if creating; number of rows in the append
if appending.
if KeyLength(x) >= 20 bytes
SortLength(x) = (4 + KeyLength(x))
if KeyLength(x) < 20 bytes
SortLength(x) = 4 + (4 * floor((KeyLength(x) + 3) / 4))
For example, consider
the following SAS code:
DATA foo.test ;
length xc $15 ;
do x=1 to 1000 ;
xc = left(x) ;
output ;
end ;
run ;
PROC DATASETS lib=foo ;
modify test ;
index create x xc xxc=(x xc) ;
quit ;
For index X, space would
be:
SpaceBytes = 1000 * (4 + (4 * floor((8 + 3) / 4)))
= 1000 * (4 + (4 * floor(11 / 4)))
= 1000 * (4 + 4 * 2)
= 12,000
For index XC, space
would be:
SpaceBytes = 1000 * (4 + (4 * floor(15 + 3) / 4)))
= 1000 * (4 + (4 * floor(18 / 4)))
= 1000 * (4 + 4 * 4)
= 20,000
For index XXC, space
would be:
SpaceBytes = 1000 * (4 + 23)
= 1000 * 27
= 27,000
There is one other factor
that plays into workspace computation: Are you creating the indexes
in parallel or serially? If you create the indexes in parallel by
using the ASYNCINDEX=YES data set option or by asserting the SPDSIASY
macro variable, you need to sum the space requirements for each index
that you create in the same create phase.
As is noted in the FAQ
example about creating SPD Server indexes in parallel, Can SPD Server create indexes in parallel?, the indexes
X and Y constitute a create phase, as do COMP and COMP2. You would
need to sum the space requirement for X and Y, and for COMP and COMP2,
and take the maximum of these two numbers to get the workspace needed
to complete the PROC DATASETS indexes successfully.
The same applies to
PROC APPEND runs when appending to the table with indexes. In this
case all of the indexes are appended in parallel so you would need
to sum the workspace requirement across all indexes.