Usage Note 57630: Overwriting a large SASĀ® file can require more time than writing a copy of the same file
Noticeable latency can occur when you attempt to overwrite a very large (hundreds of Gigabytes) data set.
The problem occurs when:
- the system file cache (the portion of RAM where SAS pages all of its Read and Write operations to and from storage) is so large that the entire data set can fit into the cache, and
- the system is quiet, so that there is no contention from other processes.
This example code demonstrates the scenario:
data newds;
set multi_gig_ds;
run;
data newds;
set newds;
run;
In this example, multi_gig_ds is a data set whose size is 200GB,
and the system file cache is larger than 200GB.
The first DATA step creates a new data set (newds) by
setting an existing data set (the 200GB multi_gig_ds).
The second DATA step updates the data set newds "in place."
That is, newds is overwritten.
Although the two DATA steps appear to be performing the same type of operation,
there is a critical difference.
Because the second DATA step is updating a file in place, it must create a
.lck file called
newds.sas7bdat.lck in the WORK directory. That file will
eventually be renamed newds.sas7bdat.
The file newds.sas7bdat.lck
cannot be committed to disk and closed as newds.sas7bdat
until all the data associated with the original SAS data set
newds.sas7bdat (referenced in the SET statement),
has been completely flushed from file cache (that is, the RAM).
It is important to understand that when a data set is overwritten, as in the second DATA step,
the exact same pages of the original
newds.sas7bdat residing in the host cache are not
simply being updated.
Rather, those pages are used to create a copy of the file: a new locked file called
newds.sas7bdat.lck.
Only after the pages from the old file with the same name are completely flushed from the system file cache, and the original
file is deleted from storage, can the new newds.sas7bdat file
be committed to disk.
When the file is hundreds of Gigabytes in size, as in this example scenario, and most of its pages reside in the host system file cache, this flush can take a considerable amount of time.
The original file must be emptied from cache by the page flush daemons, and on storage, it must be
replaced by the intermediate file newds.sas7bdat.lck
before it can be closed and committed to storage.
The delay is commensurate with the size of the file and how many of its pages reside in the host cache.
If your situation matches the one described above, you can avoid this problem by giving extremely large
files a new name, rather than overwriting them.
Operating System and Release Information
SAS System | Base SAS | 64-bit Enabled AIX | 9.4 TS1M1 | |
64-bit Enabled Solaris | 9.4 TS1M1 | |
HP-UX IPF | 9.4 TS1M1 | |
Linux for x64 | 9.4 TS1M1 | |
Solaris for x64 | 9.4 TS1M1 | |
*
For software releases that are not yet generally available, the Fixed
Release is the software release in which the problem is planned to be
fixed.
If you attempt to overwrite a very large data set, and that data set fits into the system file cache, the step might require more time than writing a copy of that same large file.
Date Modified: | 2016-02-10 19:40:21 |
Date Created: | 2016-02-10 17:56:16 |