The COMPRESS= option is both a SAS system option and a data set option that can be used to minimize the disk space required to store a SAS data set. The COMPRESS= option performs a sophisticated check for patterns among data elements and then uses run-length encoding (RLE) or Ross Data Compression (RDE) to efficiently represent patterns in less space than was required by the original data.
COMPRESS=YES or COMPRESS=CHAR uses RLE to encode repeating single characters in an observation. COMPRESS=BINARY uses RDC to encode repeating patterns of multiple characters in an observation. You should be aware that, regardless of the value used in the COMPRESS= option, all compressed numeric variables are expanded into 8 bytes in the program data vector prior to being used in computations in the DATA or PROC steps. So, although compression saves disk space, it requires additional CPU time.
The lengths of the data elements are not changed when compression is specified. The default length of a numeric variable is 8 bytes. To further reduce disk space, you can specify minimum lengths for numeric variables. For example, in SAS for Windows, a numeric variable with a length of 3 bytes can represent integer values up to 8,192. Let's assume that the maximum value of the variable is less than 8,192. If the default length of 8 bytes is used, 5 of the 8 bytes are filled with 0 and are unused. This wastes disk space for each observation.
For large data sets, substantial savings of disk space might accrue if numeric variables are specified in the LENGTH statement to be only the length necessary for the maximum value that will be stored in them. Judicious application of the function TRUNC may be used to determine the minimum number of bytes required to represent a variable's value without loss of accuracy. The following table shows the relationship between the length of a numeric variable and the largest integer value that it can represent given that specific length.
Significant Digits and Largest Integer by Length for SAS Variables under Windows | ||
Length in Bytes | Largest Integer Represented Exactly | Exponential Notation |
3 |
8,192 |
213 |
4 |
2,097,152 |
221 |
5 |
536,870,912 |
229 |
6 |
137,438,953,472 |
237 |
7 |
35,184,372,088,832 |
245 |
8 |
9,007,199,254,740,992 |
253 |
It is important to note that only integer values are represented exactly. A real number with a fractional part can be represented by less than 8 bytes, but it might lose some of its accuracy due to truncation. Therefore, the "squeezing" technique that is discussed here may be used on integer-valued variables, but should not to be used on real-valued variables or your data might be rendered inaccurate by the truncation process. In the code fragment that follows, the integer numeric variable a is squeezed to the minimum number of bytes that are required to exactly represent all the values contained in it by using the function TRUNC.
data test ; a = 2001 ; if trunc( a, 7 ) ne a then length_a = 8 ; else if trunc( a, 6 ) ne a then length_a = 7 ; else if trunc( a, 5 ) ne a then length_a = 6 ; else if trunc( a, 4 ) ne a then length_a = 5 ; else if trunc( a, 3 ) ne a then length_a = 4 ; else length_a = 3 ; put a= length_a= ; run ; |
The result posted to the log file is:
a=2001 length_a=3 |
Also, if a<=8192, then the minimum number of bytes to exactly represent a is 3, but if a=8193, then length_a=4. What a difference a small change makes!
The code to compute the minimum number of bytes that are required for an arbitrary SAS data set is contained in the %SQUEEZE macro. %SQUEEZE finds the minimum length required to preserve the precision of a SAS integer numeric variable.
In the following code snippet,
libname sample 'C:\Program Files\SAS Institute\SAS\V8\dmine\sample' ; proc contents data=sample.dmdcens ; run ; |
The output from PROC CONTENTS before using
-----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos Format 1 age Num 8 0 4 cap_gain Num 8 24 5 cap_loss Num 8 32 7 class Num 6 48 9. 16 country Num 6 102 9. 17 country2 Num 6 108 9. 10 educ Num 6 66 9. 3 educ_num Num 8 16 2 fnlwgt Num 8 8 6 hourweek Num 8 40 11 marital Num 6 72 9. 12 occupatn Num 6 78 9. 14 race Num 6 90 9. 13 relation Num 6 84 9. 15 sex Num 6 96 9. 9 workcla2 Num 6 60 9. 8 workclas Num 6 54 9. |
The output from PROC CONTENTS after using the
-----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos Format 1 age Num 3 8 4 cap_gain Num 4 4 5 cap_loss Num 3 14 7 class Num 3 20 9. 16 country Num 3 47 9. 17 country2 Num 3 50 9. 10 educ Num 3 29 9. 3 educ_num Num 3 11 2 fnlwgt Num 4 0 6 hourweek Num 3 17 11 marital Num 3 32 9. 12 occupatn Num 3 35 9. 14 race Num 3 41 9. 13 relation Num 3 38 9. 15 sex Num 3 44 9. 9 workcla2 Num 3 26 9. 8 workclas Num 3 23 9. |
The effects of compression are shown in the next table. The compression ratio is the ratio of the size of the compressed data set (in bytes) to the size of the uncompressed data set, and represents a measure of reduction in the disk space required to store the compressed data set.
For example, the baseline census data set, before being compressed and before using
The statistic (1-compression ratio) represents the amount of disk space that is gained by compressing the census data set. Using
Census Data |
|||||
Compress Option |
|
Pages |
Page Size |
Size (Bytes) |
Compression Ratio |
None |
No |
320 |
12,288 |
3,932,160 |
N.A. |
None |
Yes |
225 |
8,192 |
1,843,200 |
47% |
Binary |
No |
225 |
8,192 |
1,843,200 |
47% |
Binary |
Yes |
529 |
4,096 |
2,166,784 |
55% |
Char |
No |
225 |
8,192 |
1,843,200 |
47% |
Char |
Yes |
394 |
4,096 |
1,613,824 |
41% |
Because there are no character variables in the data, you might not consider the census data to be representative. Therefore, the data set that contains leukemia experimental data, leukv, was used. This data set has 7,129 observations and 105 variables (69 integer numeric and 36 character). The results are shown in the following table.
Leukemia Data |
|||||
Compress Option |
|
Pages |
Page Size |
Size (Bytes) |
Compression Ratio |
None |
No |
286 |
16,384 |
4,685,824 |
N.A. |
None |
Yes |
147 |
16,384 |
2,408,448 |
51% |
Binary |
No |
129 |
16,384 |
2,113,536 |
45% |
Binary |
Yes |
119 |
16,384 |
1,949,696 |
42% |
Char |
No |
159 |
16,384 |
2,605,056 |
56% |
Char |
Yes |
139 |
16,384 |
2,277,376 |
49% |
A final example which uses generated data might be considered more representative of some data sets. Fifty numeric variables were created with each variable containing random values. Half of these created variables had integer values in the range [0, 1000000] and the other half had real values in approximately the same range. Limiting the range is justified because many numeric variables are represented to the nearest thousand with sufficient accuracy, and it saves space on a printed report. The following code creates the SAS data set A.
data a ; array a {*} a1-a50 ; do i = 1 to 1000000 ; x = ranuni(2001) ; y = min(6, round(10*x, 1)) ; do j = 1 to 50 ; a{j} = ranuni(2002) * 10 ** y ; if mod(j, 2) = 1 then a{j} = floor(a{j}) ; end ; output ; end ; run ; |
There are 28 integer and 26 real numeric variables in the data set, including the loop indices i and j, and the function results x and y. The compression results are shown below.
Generated Data |
|||||
Compress Option |
|
Pages |
Page Size |
Size (Bytes) |
Compression Ratio |
None |
No |
27,028 |
16,384 |
442,826,752 |
N.A. |
None |
Yes |
19,609 |
16,384 |
321,273,856 |
73% |
Binary |
No |
23,182 |
16,384 |
379,813,888 |
86% |
Binary |
Yes |
20,230 |
16,384 |
331,448,320 |
75% |
Char |
No |
22,520 |
16,384 |
368,967,680 |
83% |
Char |
Yes |
20,043 |
16,384 |
328,384,512 |
74% |
This final set of compression ratios helps to confirm the pattern that we have seen -- the best compression is obtained with data sets that contain small integer numeric variables. Data sets that contain large numbers of real numeric variables are limited in the amount of compression that can be obtained.
In general, compression is a valuable technique to optimize disk space requirements for a SAS data set. The cost of compression is the additional CPU cycles required to compress the data set initially, and then to decompress a compressed data set prior to its use. However, the amount of time that is used might be justified because I/O speed is (usually) orders of magnitude slower than CPU speed, and the less I/O performed, the better.
Additional savings in disk space might be obtained using
About the author:
Ross Bettinger is a SAS Analytical Consultant. He provides support for Enterprise Miner and has been involved with data mining projects for 7 years. He has been a SAS user for 15 years. His professional interests are related to data mining, statistical analysis of data, feature selection and transformation, model building, and algorithm development.
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.
Type: | Sample |
Date Modified: | 2005-09-08 03:03:13 |
Date Created: | 2004-10-07 12:53:41 |
Product Family | Product | Host | SAS Release | |
Starting | Ending | |||
SAS System | Base SAS | All | n/a | n/a |