• Print  |
  • Feedback  |

FOCUS AREAS

DATA step

Base SAS

Understanding the Compression Attribute of a SAS Data Set

Overview

The compression attribute (via the COMPRESS= option) of a SAS data set is set when you create the file, and this attribute is a permanent setting for the data set. To change the compression attribute of a file, you have to re-create the file. When compression is used for a data set, SAS compresses only the observations, not the entire file. The compression attribute only affects the observations of the data set. The attribute does not affect procedures.

When the I/O engine delivers an observation from disk to a procedure such as APPEND, SQL, and so on, the observation is delivered in an uncompressed format by default. After the procedure is finished with the given observation and is ready to update or add to the data set, the procedure returns the observation to the engine. If the compression attribute is set to yes, then the engine compresses the observation again before placing it back into the file.

You can add to or update an existing compressed file from an uncompressed file and vice versa. The original setting of the compression attribute for the target file determines whether the I/O engine compresses the observations of the source file before the engine places them into the target file.

Determining the Effectiveness of Compression

After the data set is initially created, you can determine whether compression is effective for your data set. For example, given original-compressed-data-set in your compressed data set, follow these steps to see the difference in the total byte size between the compressed file and an uncompressed version of the file:

  1. Run the following code to create an uncompressed file from the compressed file:
       data new_data (compress=no);
          set original-compressed-data-set;
       run;
  2. Run the CONTENTS procedure for both data sets:
       proc contents data=data-set-name;
       run;
  3. In the output of each data set, multiply the Data Set Page Size that is displayed by the Number of Data Set Pages that is displayed. This number provides a close approximation of the total number of bytes required for each data set. Compare these two numbers to determine whether compression is effective for your situation.

Alternatively, you can use the operating system's LISTFILE command to generate the total byte size for the compressed data set and the uncompressed data set.

Factoring in Overhead

Each observation in a compressed data set requires a minimum of overhead. The overhead varies for each operating environment, but is typically between 812 bytes per observation.

In the following example of code, the I/O engine flips the compression setting back to COMPRESS=no because the size of the record (8 bytes for the X= option, a double variable) plus 8 bytes of overhead is 16 bytes:

   data a(compress=yes);
      x=1;
      output;
   run;

In this instance, the data set might require more space to compress than if the data set were left uncompressed.

For compression to be effective, you need a sufficiently large record length. Determining whether you should compress a file is strictly a disk space trade-off. The I/O engine does not account for the CPU cost because the engine cannot calculate the user's cost versus the disk cost.

Considering Costs

Some organizations charge their users for computer usage based on CPU time, disk storage, or both. These charges vary among organizations even when the same hardware is being used. Each hardware environment differs in costs versus their respective disk costs. Thus, the proper setting for the COMPRESS= option cannot be predicted programmatically for all cases. Because hardware costs are continually being reduced, the multitude of variables associated with cost cannot be maintained through a program. You must determine your costs based on your individual circumstances with regard to whether to compress your data sets.

Along with the trade-off between CPU costs versus disk costs as described previously, cache also has become a factor to consider when you determine costs. Cache has increased in size over time, and operating environments have improved to handle cache better. If your CPU costs and disk costs are equal, then you might compress your files so that your jobs run faster.

However, if your cache can hold the entire compressed data set, then the overhead to compress and uncompress your records is minimal. This cost can be much less than the cost of rolling parts of an uncompressed file into and out of the cache when the size of the data set is too large.

Summary

The compression attribute of a SAS data set can help you save space, time, and money as you work with your data sets. When you consider whether to compress your files, you should factor in the length of your observations, the costs of computer usage, and the running time of your jobs. Your individual circumstances will help you to decide whether compression is appropriate for your SAS data sets.


Your Turn

The developers, testers, and writers that bring you Base SAS Software are very excited about the potential of these capabilities of the SAS System. You can send electronic mail to Base.Research@sas.com with your comments.