Tuning Tips for Applications That Use SAS/SHARE Software |
Now that you understand how SAS and SAS/SHARE software use files and computer resources, it's time to apply that knowledge to the design and implementation of your applications.
The most productive way to optimize the performance of your application is programming it to work as efficiently as possible. You can almost always realize more performance improvement by coding your application to exploit features of SAS than you can gain by adjusting the operation of SAS.
When you decide to adjust SAS to operate differently, remember that tuning is a balancing act and invariably requires compromise. Of course, to effectively tune SAS you must understand what your application's bottlenecks are.
This section will first list some programming techniques that are based on the information presented earlier in this paper. After that, the tuning options of SAS/SHARE software and SAS will be described.
Programming Techniques |
The most obvious way to reduce the amount of work done by a server is eliminating unused variables and observations from the files that are accessed through the server. To make sure that your files are no larger than they need to be, periodically remove or archive unused data.
As a SAS data file matures, users add new observations, update existing observations, and forget about old observations. In most cases the level of activity is greatest on the newest observations. If the users of your application do not frequently access older information, consider moving older observations from files that are concurrently updated to archive files that are accessed directly (instead of through a server).
Also as a SAS data file matures, new variables are added, some variables turn out to be larger than they need to be, and some variables lose their usefulness. Periodically check your application's SAS data files for variables that are longer than they need to be and for variables that are no longer used.
While compressing a SAS data file reduces the number of pages in it, compression cannot be as efficient at eliminating unused space as you can be by deleting unused observations and variables and by shortening variables that are longer than necessary.
Smaller data files improve the performance of all SAS sessions by reducing the amount of disk space required by each file, by reducing the number of I/O operations required to process the data in each file, and by reducing the number and size of messages required to transmit the data in a file between a server and its users.
Creating a subset of the observations in a SAS file can consume large amounts of the I/O and messages resources. There are several subsetting techniques available in SAS:
When an index is not used to locate directly the observations that satisfy a WHERE clause, the process in the server's session must read observations from the data file until it finds one that matches the WHERE clause. This can consume a very large amount of the I/O and CPU resources. Those resource requirements can be greatly reduced when the variables in the WHERE clause are indexed.
The subsetting IF statement in the DATA step and the FIND, SEARCH, and LOCATE commands in SAS/FSP procedures perform the specified comparison in the user's SAS session instead of in a process in a server. This requires that every observation in the SAS data set be transmitted from the server's session to the user's session, which can consume a very large amount of the messages resource, in addition to the I/O and CPU resources required to read the data file. Because the comparisons of a WHERE clause are performed in the server's session, only the selected observations are transmitted to the user's session and the message resource is conserved.
The I/O resource consumption is the same unoptimized WHERE, subsetting IF, and FSP's FIND, SEARCH, and LOCATE. Using WHERE clauses is recommended, however, because the messages resource consumption is higher for the subsetting IF statement and the FIND, SEARCH, and LOCATE commands.
Indexing is a tool that optimizes WHERE clause selection of observations from SAS data sets. A WHERE clause without an index requires the process in the server to read every observation in a SAS data set to find the observations that match the WHERE selection criteria. An index often enables the server to locate the records that satisfy a WHERE clause without having to read the records that do not match.
Adding indexes might be a good idea if your application seems to be taking too long to execute WHERE clauses. However, indexes require extra memory and might present a problem for a server that is memory constrained.
A complete description of index usage can be found in the paper "Effective Use of Indexes in the SAS System," in the Proceedings of the SAS User's Group International Sixteenth Annual Conference.
When a SAS data file is accessed through a server, creating an index on it prevents the server from responding to other users' requests. While it can be useful to create an index while a data file is accessed through a server, indexes on large files should be created after hours. Indexes on large data files should not be created while a server is expected to respond quickly to users' requests.
Larger page sizes can be used to reduce the number of I/O operations required to process a SAS data file. But it takes longer to read a large page than it takes to read a small one and larger pages can increase the memory load on a server.
Large page sizes can be useful if most of the observations on each page are likely to be accessed each time the page is read into the server's memory, or if a large page size causes all or most of a SAS data file to be kept in the server's memory. Otherwise, large page sizes can increase the amount of time required to perform I/O activity in the server's SAS session to the detriment of the server's ability to provide timely response to users' requests.
It is often worth the effort to study the order in which the users of your application access the data in your application's files. That tells you how widely dispersed your users' patterns of reference are. Sometimes you can reduce the amount of dispersal by sorting one or more files by a variable (like date last updated) that happens to correlate (even to a small degree) with your users' pattern of access.
Here are the components of SAS that are used most frequently to access data in a random order:
the "n" (position to observation number) command of a SAS procedure
the POINT= option in the SET and MODIFY statements in the DATA step
the KEY= option in the SET and MODIFY statements in the DATA step
The SCL OPEN() function allows you to specify that a file will be sequentially accessed (the default is random access). There are two types of sequential access that can be requested with SCL OPEN():
The server will by default transmit multiple observations per read for either 'IS' or 'IN' open modes.
If the application's use of data is predominantly sequential, but you occasionally need to reread a previously read observation, then use a mode of 'IN' or 'UN' in your SCL OPEN() function. If the application's use of data is strictly sequential (you will never revisit a previously read observation) then use the open mode 'IS' or 'US'. The 'IS' and 'US' open modes are the most efficient for SCL. An 'IS' or 'US' open mode, however, will restrict an SCL application to those functions that access data sequentially. Here are the SCL functions that access data in a random pattern:
Specifying an access pattern in an SCL OPEN() function is documented in the OPEN function section in SAS Component Language: Reference. Here is an example of specifying a sequential access pattern in an SCL OPEN() function:
DSID = OPEN( 'MYLIB.A', 'IN' );
An open file consumes memory in both users' and servers' SAS sessions. If a server consumes too much memory, check the applications that access data through that server to see if any of them open files before they are needed or leave files open when they are not being used.
There are three strategies for using SAS data sets in an SCL program:
open during initialization of the application and leave open until the application terminates
open as needed, then leave open until the application terminates
The initialization code of an application is the place to open the SAS data sets that will be used throughout the execution of the application. But if an application's initialization code must open a large number of files, the time it takes to get started can be long. By studying how an application is used, you might discover some SAS data sets that can be opened as functions are requested while the application executes, which can reduce the amount of time the application takes to initialize and reduces the concentration of time required to open files.
Whether they are opened during initialization or later, lookup tables that are small should usually not be closed until an application terminates because the single I/O buffer required by such a lookup table does not require a large amount of memory. In such a case it is frequently economical to use a small amount of the memory resource to conserve the CPU resource that would be required to open and close the lookup table over and over.
Larger SAS data sets, and SAS data sets that are used extremely infrequently (for example, once during initialization) or during a seldom-used function (for example, a lookup table on a rarely updated field), should usually be left closed whenever they are not being used.
Consider how frequently each of your application's reports is generated and how timely the data summarized by the report must be. If a report must be based on current information, it must be based on files that are concurrently updated. A report that does not require up-to-the-second information can be generated from files that are directly (and inexpensively) accessed instead of files that are accessed through a server.
For example, a travel agent making reservations or a stock broker making trades require every query to show up-to-the-second information. On the other hand, daily reports or analysis of long-term trends can use data that are out of date by several hours, several days, or even several weeks or months.
When copying data from a server, it can be subset horizontally with a WHERE clause, and it can be subset vertically with a DROP= or KEEP= data set option. (In relational terminology, the horizontal subsetting is selection and vertical subsetting is projection.) Be sure to take advantage of both methods when copying a file from a server to make the copy of the file as small as possible and, therefore, ensure that reports are generated as efficiently as possible.
Don't forget that files can be stored in users' WORK libraries. It can be most efficient to copy a file that is concurrently updated from a server to a user's WORK library and then use that file more than one time to generate reports. Such a copy of a file contains very timely data yet is not especially expensive to create or use.
A SAS data file that is accessed directly is almost always less costly to use than a file that is accessed through a server.
Many applications contain one or more query functions that use a lookup file to offer a user a set of values that are valid to enter into a field. Such a file is read, but never updated, by the majority of the users of the application. Occasionally, values must be added to and removed from the lookup files as the valid input data for the application changes.
A lookup file that is used frequently and updated less often than once a week is likely to be a good candidate for not being accessed through a server, if it would be easy to find some time during the week when the files can be updated because the application is not being used. On the other hand, a lookup file that is updated many times each day should, in many cases, be accessed through a server because updating the file will be convenient: the lookup file can be updated while users use it to perform queries.
SAS catalog entries illustrate another way that update frequency can change.
An application might use only a few or many catalog entries. Like lookup files, catalog entries that are updated frequently are likely candidates for access through a server. But catalog entries that are never changed, or only changed very infrequently, should not be accessed through a server.
The update frequency might change for some of an application's catalog entries over time. For example, while an application is under development and being tested, it can be extremely convenient for the developers of the application to be able to update any catalog entry while those testing the application continue with their work. During this phase, the convenience of accessing the catalog entries through a server can more than pay for the cost of the overhead of server access. After the testing is completed and the application has stabilized, some or all of the application's catalogs can be moved to a SAS library that is accessed directly by the users of the application; in this phase efficient access by the users is more important than being able to update the catalog entries conveniently.
Remember that not all of an application's catalog entries must be accessed the same way. Catalog entries that must be frequently updated can continue to be accessed through a server, while other catalog entries that change very seldom can be stored in SAS catalogs that are accessed directly by the users of the application.
While it is creating each observation, a process in a server's session that is interpreting a DATA step view does not yield control to allow other processes in the server to execute other users' requests. While DATA step views can be useful in a server, they must be used carefully. A DATA step view that requires a small amount of processing to create each observation will not prevent other processes in a server's SAS session from responding to other users' requests. But a DATA step view that contains many DO loops with many calculations, and reads (or even writes) many records in external files or SAS data sets, can take a very long time to create each observation. Such a DATA step view should not be interpreted in a server's session because it does not yield control until each observation is created.
If it is advantageous to your application for its DATA step views to be interpreted in a server's session, be sure that any external files read by the DATA step view are available to the server's SAS session.
Tuning Options in SAS/SHARE Software |
SAS/SHARE software makes some assumptions about the relative values of resources. For example, SAS/SHARE software considers messages to be more expensive than memory so it attempts to use more memory to reduce the number of messages. The default values and behavior might not be optimum for your application, so you have the opportunity to adjust the following:
when and in what amounts observations are transmitted in groups instead of individually
which SAS data views are interpreted in users' SAS sessions and which are interpreted in the server's SAS session
how frequently a long-running process in a server's SAS session yields to allow other users' requests to be processed
SAS/SHARE software automatically attempts to conserve the message resource by transmitting observations in groups whenever possible. Observations can always be transmitted in groups when a data set is being created or replaced, but when a data set is opened for update access it is never appropriate to transmit more than one observation at a time. The grouping of observations when a data set is opened for input depends on the situation; you control whether observations are transmitted in groups according to these factors:
whether the data set is opened for random or sequential access
the use of the TOBSNO= data set option to override the default behavior
Here are the factors that control how many observations are transmitted in each group:
The TBUFSIZE= system option in the PROC SERVER statement specifies the suggested size of a buffer that the server uses for transmitting information to or receiving information from a client. When this option is not specified in the PROC SERVER statement, the value of the TBUFSIZE SAS system option, if specified, is used. The default value is 32K.
A key use of these transmission buffers is in transmitting observations. The server uses the TBUFSIZE value when computing the number of observations to transmit in each multi-observation transfer between the server and the client sessions. However if the observation size, plus overhead, exceeds the TBUFSIZE value, only single-observation transfers are done. Specifying an excessive value for TBUFSIZE= might cause your server or clients to run out of memory and to terminate abnormally.
You cannot calculate the number of observations per transfer by dividing the observation length into the value that you specify for the TBUFSIZE= option. To determine the effect of this option on your data sets, use the PROC SERVER options LOG=MESSAGE and ACCTLVL=DATA and compare the number of messages exchanged between the server and the client sessions as a function of the value of the TBUFSIZE= option and the number of observations in the data set.
Here is an example of using the TBUFSIZE= option:
PROC SERVER TBUFSIZE=128K <other PROC SERVER options>;
Independently of the TBUFSIZE= option's effect on a server's overall behavior, you can control the number of observations per group for individual data sets that are accessed through the server. For example, if you specify TOBSNO=3, three observations will be sent in each message.
The TOBSNO= option can be specified wherever SAS data set options are accepted: as an argument to the OPEN() function of SAS Component Language, in the DATA= option in a SAS procedure, and in the SET, MERGE, UPDATE, and MODIFY statements in the DATA step. It must be specified for each data set for which you want grouping behavior that is different from the default.
When a data set is opened for input with a sequential access pattern, a server calculates the number of observations per group as the smallest of the following:
When a SAS data set is opened for input with a random access pattern, the default behavior is transmitting observations individually (the group size is one). This ensures that a user always receives up-to-date data when they position to an observation, and it reduces wasted communications bandwidth because no observations are transmitted to a user's session except the specific observations requested.
At other times, the TOBSNO= data set option can be used to increase the number of observations transferred in each group. For example, consider an SCL program in which the SAS data set DSID is passed to a DATALISTC() or DATALISTN() function. The data set is read from beginning-to-end by the function, and then the observation chosen by the user is reread. Because by default the OPEN() function of SCL specifies a random access pattern, observations for that DSID are transmitted individually. But the access pattern of the DATALISTC() and DATALISTN() functions is really skip sequential, so transmitting observations individually is not optimum. TOBSNO=4 could be specified in a case like this to reduce the number of messages by three-quarters. (Note that the user could change the open mode from 'I' to 'IN' as an alternative to specifying the TOBSNO= data set option.)
The number of observations transmitted when a data set is opened for input is summarized below. Here is an example of using the TOBSNO= data set option:
PROC FSVIEW DATA=MYLIB.A(TOBSNO=10);
Consider each SAS data view used by your application and determine whether the view should be interpreted in the server's SAS session or the users' SAS sessions. You decide where to have a view interpreted according to these considerations:
Some PROC SQL views are especially good candidates for interpretation in a server's SAS session because the number of observations produced by the view is much smaller than the number of observations read by the view, the data sets read by the view are available to the server and the amount of processing necessary to build each observation is not large.
Other PROC SQL views should be interpreted in users' SAS sessions because the number of observations produced by the view is not appreciably smaller than the number of observations read by the view, some of the data sets read by the view can be directly accessed by the users' SAS sessions, and the amount of processing done by the view is considerable.
By default, SAS data views are interpreted in a server's SAS session, but the RMTVIEW= option in the LIBNAME statement enables you to have the views in a library interpreted in users' SAS sessions instead. The NORMTVIEW option in the PROC SERVER statement enables you to prevent all SAS data views from being interpreted in the server's session.
SAS/ACCESS views do not provide update access to the underlying database when they are interpreted in a server's session, so it is often more practical to interpret SAS/ACCESS views in users' SAS session.
If it is useful for your application to have a SAS/ACCESS view interpreted in a server's session, ensure that all of the necessary database interface components are available to the server's session.
If a user's SAS session is capable of using a SAS/ACCESS interface engine to access the underlying database, it is more efficient to execute the SAS/ACCESS interface engine in the user's SAS session. Note that in this case it might be convenient to store the view file in a SAS library that is accessed through a server if the view will be updated frequently and used by more than one user.
Like SAS/ACCESS views, DATA step views are very often most useful when interpreted in users' SAS sessions. For more information about interpreting DATA step views in a server's session, see Know Your Application's DATA Step Views.
For a complete description of the RMTVIEW= option in the LIBNAME statement, see Remote Library Services.
Here are some examples of specifying the RMTVIEW= and NORMTVIEW options:
LIBNAME MYLIB 'my SAS library' RMTVIEW=YES <other LIBNAME options>; PROC SERVER NORMTVIEW <other PROC SERVER options>;
Some components of SAS yield control periodically and can be directed to do so more or less frequently than their default rate. These components are called long-running processes and include evaluating WHERE clauses and interpreting PROC SQL views.
Changing the rate at which control is yielded is delicate because the act of yielding control consumes some CPU resource: increasing the frequency at which control is yielded increases a server's CPU consumption all by itself. You can change the rate at which the processes in a server yield control by varying the value of the PROC SERVER option LRPYIELD=. The default value of this option is 10,000; the option has no units.
To make long-running processes yield relatively more frequently, specify a value greater than 10,000. While a higher value might have the effect of providing more even response time to a server's users, this comes at the expense of increased consumption of the server's CPU resource. Also, the processes that run for a long time run even longer when they are asked to yield more frequently.
To make a long-running process yield less frequently, specify a value smaller than 10,000. A lower LRPYIELD= value might make some individual user requests (like an SQL join with a sort) complete sooner, but the server's other users are forced to wait as the long-running process does more work before it yields control. Response time can become more uneven when long-running processes are allowed to yield less frequently.
This option is documented in The SERVER Procedure.
Here is an example of specifying the LRPYIELD= option:
PROC SERVER LRPYIELD=5000 <other PROC SERVER options>;
This is not an option you specify in a SAS program statement; instead it is a method of managing the workload of concurrent access to SAS data sets.
If you determine that a server is consuming too much of a resource and you can not reduce the server's consumption of that resource any further, creating an additional server allows you to divide your applications' workload among several servers.
SAS/SHARE software includes a family of SAS macros that help you manage SAS file access through multiple servers. Those macros are documented in SAS/SHARE Macros for Server Access.
SAS System Options |
SAS has several SAS I/O tuning options. These options are most relevant to applications that access data through a server:
When a file is created, use the BUFSIZE= data set option to specify the size of the pages of the file. The SAS default page size is optimum for files that are processed sequentially, but it might not be optimum when the observations of a file are accessed in random order. PROC CONTENTS shows the page size of a SAS data file.
You might find it useful to balance the pattern in which a file is randomly accessed against the number of observations stored on each page of the file. If most random access sequences access observations in very different locations in the file, then a small page size will improve performance because most of the observations on each page are not used. On the other hand, if most random access sequences are likely to be to observations that are physically near each other in the file, you might be able to take advantage of a large page size to have many of the observations read from the file into the server's memory at once.
If you want to keep all or most of a SAS data file in memory, you can choose a very large page size. Of course, this can consume a lot of the server's memory so you should use such a page size only when you really want to. If you expect that not much data from a large file will need to be in memory at one time, choose a small page size to make reading and writing each page as fast as possible.
If you find that your server is spending a significant amount of time waiting for I/O operations to complete, consider recreating the files that are not used for sequential access with a smaller page size.
Here is an example of using the BUFSIZE= data set option:
DATA MYLIB.A(BUFSIZE=6K); SET MYLIB.A; RUN;
This option is used to cause a SAS data file to be stored in compressed format. Compressing files usually makes them smaller, so a server is able to read more observations per I/O request when a file is compressed. The reduction in file size for a compressed file (which conserves the I/O resource) is paid for, though, by an increase in the consumption of the CPU resource (which is used to decompress the observations as the file is read). If your server is CPU-bound, compression could do more harm than good, but if your server is I/O-bound, compression could reduce its consumption of the I/O resource.
Copyright © 2007 by SAS Institute Inc., Cary, NC, USA. All rights reserved.