Optimizing System Performance |
The information that is presented in this section applies to reading and writing SAS data sets. In general, the larger your data sets, the greater the potential performance gain for your entire SAS job. The performance gains that are described here were observed on data sets of approximately 100,000 blocks.
Allocating Data Set Space Appropriately |
SAS initially allocates enough space for 10 pages of data for a data set. Each time the data set is extended, another five pages of space is allocated on the disk. OpenVMS maintains a bitmap on each disk that identifies the blocks that are available for use. When a data set is written and then extended, OpenVMS alternates between scanning the bitmap to locate free blocks and actually writing the data set. However, if the data sets were written with larger initial and extent allocations, then write operations to the data set would proceed uninterrupted for longer periods of time. At the hardware level, this means that disk activity is concentrated on the data set, and disk head seek operations that alternate between the bitmap and the data set are minimized. The user sees fewer I/Os and faster elapsed time.
Large initial and extent values can also reduce disk fragmentation. SAS data sets are written using the RMS algorithm "contiguous best try." With large preallocation, the space is reserved to the data set and does not become fragmented as it does when inappropriate ALQ= and DEQ= values are used.
SAS recommends setting ALQ= to the size of the data set to be written. If you are uncertain of the size, underestimate and use DEQ= for extents. Values of DEQ= larger than 5000 blocks are not recommended. For information about predicting data set size, see Estimating the Size of a SAS Data Set under OpenVMS.
The following is an example of using the ALQ= and DEQ= options:
libname x '[]'; /* Know this is a big data set. */ data x.big (alq=100000 deq=5000); length a b c d e f g h i j k l m _n o p q r s t u v w x y z $200; do ii=1 to 13000; output; end; run;
Note: If you do not want to specify an exact number of blocks for the data set, use the ALQMULT= and DEQMULT= options.
Turning Off Disk Volume High-Water Marking |
High-water marking is an OpenVMS security feature that is enabled by default. It forces prezeroing of disk blocks for files that are opened for random access. All SAS data sets are random access files and, therefore, pay the performance penalty of prezeroing, increased I/Os, and increased elapsed time.
Two DCL commands can be used independently to disable high-water marking on a disk. When initializing a new volume, use the NOHIGHWATER_MARKING qualifier to disable the high-water function as in the following example:
$ initialize/nohighwater $DKA470 mydisk
To disable volume high-water marking on an active disk, use a command similar to the following:
$ set volume/nohighwater $DKA200
Eliminating Disk Fragmentation |
Any software that reads and writes from disk benefits from a well-managed disk. This applies to SAS data sets. On an unfragmented disk, files are kept contiguous; thus, after one I/O operation, the disk head is well positioned for the next I/O operation.
A disk drive that is frequently defragmented can provide performance benefits. Use a frequently defragmented disk to store commonly accessed SAS data sets. In some situations, adding an inexpensive SCSI drive to the configuration allows the system manager to maintain a clean, unfragmented environment more easily than using a large disk farm. Data sets maintained on an unfragmented SCSI disk might perform better than heavily fragmented data sets on larger disks.
Defragmenting a disk means using the OpenVMS backup utility after regular business hours, when disk activity is likely to be minimal, to perform an image backup of a disk. Submit the following command sequence to create a defragmented copy of the source disk on the destination disk, using the files from the source disk:
$ mount/foreign 'destination-disk' $ backup/image 'source-disk' 'destination-disk'
When the image backup operation is complete, dismount the destination disk and remount it using a normal mount operation (without the /FOREIGN qualifier) so that the disk can be used again for I/O operations. SAS does not recommend the use of dynamic defragmenting tools that run in the background of an active system because such programs can corrupt files.
HP OpenVMS System Manager's Manual
HP OpenVMS DCL Dictionary
Setting Larger Buffer Size for Sequential Write and Read Operations |
The BUFSIZE= data set option sets the SAS internal page size for the data set. Once set, this becomes a permanent attribute of the file that cannot be changed. This option is meaningful only when you are creating a data set. If you do not specify a BUFSIZE= option, SAS selects a value that contains as many observations as possible with the least amount of wasted space.
An observation cannot span page boundaries. Therefore, unused space at the end of a page can occur unless the observations pack evenly into the page. By default, SAS tries to choose a page size between 8192 and 32768 if an explicit BUFSIZE= option has not been specified. If you increase the BUFSIZE= value, more observations can be stored on a page, and the same amount of data can be accessed with fewer I/Os. When explicitly choosing a BUFSIZE, be sure to choose a value that does not waste space in a data set page, resulting in wasted disk space. The highest recommended value for BUFSIZE= is 65024.
The following is an example of an efficiently written large data set, using the BUFSIZE= data set option. Note that in the following example, BUFSIZE=63488 becomes a permanent attribute of the data set:
libname buf '[]'; data buf.big (bufsize=63488); length a b c d e f g h i j k l m n o p q r s t u v w x y z $200; do ii=1 to 13000; output; end; run;
For each SAS file that you open, SAS maintains a set of caches to buffer the data set pages. The size of each of these caches is controlled by the CACHESIZE= option. The number of caches used for each open file is controlled by the CACHENUM= option. The ability to maintain more data pages in memory potentially reduces the number of I/Os that are required to access the data. The number of caches that are used to access a file is a temporary attribute. It might be changed each time you access the file.
By default, up to 10 caches are used for each SAS file that is opened; each of the caches is the value (in bytes) of CACHESIZE= in size. On a memory-constrained system you might wish to reduce the number of caches used to conserve memory.
The following example shows using the CACHENUM= option to specify that 8 caches of 65024 bytes each are used to buffer data pages in memory.
proc sort data=cache.big (cachesize=65024 cachenum=8); by serial; run;
SAS maintains a cache that is used to buffer multiple data set pages in memory. This reduces I/O operation by enabling SAS to read or write multiple pages in a single operation. SAS maintains multiple caches for each data set that is opened. The CACHESIZE= data set option specifies the size of each cache.
The CACHESIZE= value is a temporary attribute that applies only to the data set that is currently open. You can use different CACHESIZE= values at different times when accessing the same file. To conserve memory, a maximum of 65024 bytes is allocated for the cache by default. The default allows as many pages as can be completely contained in the 65024-byte cache to be buffered and accessed with a single I/O.
Here is an example that uses the CACHESIZE= data set option to write a large data set efficiently. Note that in the following example, CACHESIZE= value is not a permanent attribute of the data set:
libname cache '[]'; data cache.big (cachesize=65024); length a b c d e f g h i j k l m n o p q r s t u v w x y z $200; do ii=1 to 13000; output; end; run;
Using Asynchronous I/O When Processing SAS Data Sets |
Job type | |
User | |
Usage |
The BASE engine now performs asynchronous reading and writing by default. This allows overlap between SAS data set I/O and computation time. Note: Asynchronous reading and writing is enabled only if caching is turned on. |
Benefit |
Asynchronous I/O allows other processing to continue while SAS is waiting for I/O completion. If there is a large gap between the CPU time used and the elapsed time reported in the FULLSTIMER statistics, asynchronous I/O can help reduce that gap. |
Cost |
Because data page caching must be in effect, the memory usage of the I/O cache must be incurred. For more information about controlling the size and number of caches used for a particular SAS file, see CACHENUM= Data Set Option: OpenVMS and CACHESIZE= Data Set Option: OpenVMS. |
Asynchronous I/O is enabled by default. There are no additional options that need to be specified to use this feature. For all SAS files that use a data cache, SAS performs asynchronous I/O. Because multiple caches are now available for each SAS file, while an I/O is being performed on one cache of data, SAS might continue processing using other caches. For example, when SAS writes to a file, once the first cache becomes full an asynchronous I/O is initiated on that cache, but SAS does not have to wait on the I/O to complete. While that transaction is in progress, SAS can continue processing new data pages and store them in one of the other available caches. When that cache is full, an asynchronous I/O can be initiated on that cache as well.
Similarly, when SAS reads a file, additional caches of data can be read from the file asynchronously in anticipation of those pages being requested by SAS. When those pages are required, they will have already been read from disk, and no I/O wait will occur.
Because caching (with multiple caches) needs to be enabled for asynchronous I/O to be effective, if the cache is disabled with the CACHESIZE=0 option or the CACHENUM=0 option, no asynchronous I/O can occur.
Copyright © 2009 by SAS Institute Inc., Cary, NC, USA. All rights reserved.