24805 - %SQUEEZE-ing Before Compressing Data

Sample 24805: %SQUEEZE-ing Before Compressing Data

The COMPRESS= option is both a SAS system option and a data set option that can be used to minimize the disk space required to store a SAS data set. The COMPRESS= option performs a sophisticated check for patterns among data elements and then uses run-length encoding (RLE) or Ross Data Compression (RDE) to efficiently represent patterns in less space than was required by the original data.

COMPRESS=YES or COMPRESS=CHAR uses RLE to encode repeating single characters in an observation. COMPRESS=BINARY uses RDC to encode repeating patterns of multiple characters in an observation. You should be aware that, regardless of the value used in the COMPRESS= option, all compressed numeric variables are expanded into 8 bytes in the program data vector prior to being used in computations in the DATA or PROC steps. So, although compression saves disk space, it requires additional CPU time.

The lengths of the data elements are not changed when compression is specified. The default length of a numeric variable is 8 bytes. To further reduce disk space, you can specify minimum lengths for numeric variables. For example, in SAS for Windows, a numeric variable with a length of 3 bytes can represent integer values up to 8,192. Let's assume that the maximum value of the variable is less than 8,192. If the default length of 8 bytes is used, 5 of the 8 bytes are filled with 0 and are unused. This wastes disk space for each observation.

For large data sets, substantial savings of disk space might accrue if numeric variables are specified in the LENGTH statement to be only the length necessary for the maximum value that will be stored in them. Judicious application of the function TRUNC may be used to determine the minimum number of bytes required to represent a variable's value without loss of accuracy. The following table shows the relationship between the length of a numeric variable and the largest integer value that it can represent given that specific length.

*Significant Digits and Largest Integer by Length for SAS Variables under Windows*
Length in Bytes	Largest Integer Represented Exactly	Exponential Notation
3	8,192	2¹³
4	2,097,152	2²¹
5	536,870,912	2²⁹
6	137,438,953,472	2³⁷
7	35,184,372,088,832	2⁴⁵
8	9,007,199,254,740,992	2⁵³

It is important to note that only integer values are represented exactly. A real number with a fractional part can be represented by less than 8 bytes, but it might lose some of its accuracy due to truncation. Therefore, the "squeezing" technique that is discussed here may be used on integer-valued variables, but should not to be used on real-valued variables or your data might be rendered inaccurate by the truncation process. In the code fragment that follows, the integer numeric variable a is squeezed to the minimum number of bytes that are required to exactly represent all the values contained in it by using the function TRUNC.

data test ;
	a = 2001 ;
	if trunc( a, 7 ) ne a then length_a = 8 ; else
	if trunc( a, 6 ) ne a then length_a = 7 ; else
	if trunc( a, 5 ) ne a then length_a = 6 ; else
	if trunc( a, 4 ) ne a then length_a = 5 ; else
	if trunc( a, 3 ) ne a then length_a = 4 ; else
	                           length_a = 3 ;
	put a= length_a= ;
run ;

The result posted to the log file is:

a=2001 length_a=3

Also, if a<=8192, then the minimum number of bytes to exactly represent a is 3, but if a=8193, then length_a=4. What a difference a small change makes!

The code to compute the minimum number of bytes that are required for an arbitrary SAS data set is contained in the %SQUEEZE macro. %SQUEEZE finds the minimum length required to preserve the precision of a SAS integer numeric variable.

In the following code snippet, %SQUEEZE is used on a data set that is composed of U.S. census data taken from a SAS directory that contains data mining sample data sets. Numeric census variables are squeezed to the minimum number of bytes required to exactly represent all the values contained in them. There are 32,561 observations with 17 numeric variables in each observation.

libname sample
'C:\Program Files\SAS Institute\SAS\V8\dmine\sample' ;
proc contents data=sample.dmdcens ; run ;
%SQUEEZE( sample.dmdcens, squozen )
proc contents data=squozen ; run ;

The output from PROC CONTENTS before using %SQUEEZE is shown next. The census data has already been modified to partially optimize the lengths of the numeric variables.

-----Alphabetic List of Variables and Attributes-----

    #    Variable    Type    Len    Pos    Format

    1    age         Num       8      0          
    4    cap_gain    Num       8     24          
    5    cap_loss    Num       8     32          
    7    class       Num       6     48    9.    
   16    country     Num       6    102    9.    
   17    country2    Num       6    108    9.    
   10    educ        Num       6     66    9.    
    3    educ_num    Num       8     16          
    2    fnlwgt      Num       8      8          
    6    hourweek    Num       8     40          
   11    marital     Num       6     72    9.    
   12    occupatn    Num       6     78    9.    
   14    race        Num       6     90    9.    
   13    relation    Num       6     84    9.    
   15    sex         Num       6     96    9.    
    9    workcla2    Num       6     60    9.    
    8    workclas    Num       6     54    9.

The output from PROC CONTENTS after using the %SQUEEZE macro on the census data is shown next. Note the change in the variable lengths.

-----Alphabetic List of Variables and Attributes-----

    #    Variable    Type    Len    Pos    Format

    1    age         Num       3      8          
         4    cap_gain    Num       4      4          
         5    cap_loss    Num       3     14          
         7    class       Num       3     20    9.    
        16    country     Num       3     47    9.    
        17    country2    Num       3     50    9.    
        10    educ        Num       3     29    9.    
         3    educ_num    Num       3     11          
         2    fnlwgt      Num       4      0          
         6    hourweek    Num       3     17          
        11    marital     Num       3     32    9.    
        12    occupatn    Num       3     35    9.    
        14    race        Num       3     41    9.    
        13    relation    Num       3     38    9.    
        15    sex         Num       3     44    9.    
         9    workcla2    Num       3     26    9.    
         8    workclas    Num       3     23    9.

The effects of compression are shown in the next table. The compression ratio is the ratio of the size of the compressed data set (in bytes) to the size of the uncompressed data set, and represents a measure of reduction in the disk space required to store the compressed data set.

For example, the baseline census data set, before being compressed and before using %SQUEEZE, requires 3,932,160 bytes of disk space. The same data set before being compressed but after using %SQUEEZE requires 1,843,200 bytes, which is 47 percent of the baseline space. There has been a 53 percent reduction in the disk space that is required.

The statistic (1-compression ratio) represents the amount of disk space that is gained by compressing the census data set. Using %SQUEEZE on the data set before compression might increase the disk space that is required, depending on the compression method used and the mix of integer and real numeric, and character variables. Using COMPRESS=CHAR is more efficient than using COMPRESS=BINARY by 14 percent when combined with the %SQUEEZE macro, which is surprising because there are no character variables in the data set.

Census Data
Compress Option	%SQUEEZE	Pages	Page Size	Size (Bytes)	Compression Ratio
None	No	320	12,288	3,932,160	N.A.
None	Yes	225	8,192	1,843,200	47%
Binary	No	225	8,192	1,843,200	47%
Binary	Yes	529	4,096	2,166,784	55%
Char	No	225	8,192	1,843,200	47%
Char	Yes	394	4,096	1,613,824	41%

Because there are no character variables in the data, you might not consider the census data to be representative. Therefore, the data set that contains leukemia experimental data, leukv, was used. This data set has 7,129 observations and 105 variables (69 integer numeric and 36 character). The results are shown in the following table.

Leukemia Data
Compress Option	%SQUEEZE	Pages	Page Size	Size (Bytes)	Compression Ratio
None	No	286	16,384	4,685,824	N.A.
None	Yes	147	16,384	2,408,448	51%
Binary	No	129	16,384	2,113,536	45%
Binary	Yes	119	16,384	1,949,696	42%
Char	No	159	16,384	2,605,056	56%
Char	Yes	139	16,384	2,277,376	49%

A final example which uses generated data might be considered more representative of some data sets. Fifty numeric variables were created with each variable containing random values. Half of these created variables had integer values in the range [0, 1000000] and the other half had real values in approximately the same range. Limiting the range is justified because many numeric variables are represented to the nearest thousand with sufficient accuracy, and it saves space on a printed report. The following code creates the SAS data set A.

data a ;
	array a {*} a1-a50 ;

	do i = 1 to 1000000 ;
		x = ranuni(2001) ;
		y = min(6, round(10*x, 1)) ;

		do j = 1 to 50 ;
			a{j} = ranuni(2002) * 10 ** y ;

			if mod(j, 2) = 1 then a{j} = floor(a{j}) ;
		end ;

		output ;
	end ;
run ;

There are 28 integer and 26 real numeric variables in the data set, including the loop indices i and j, and the function results x and y. The compression results are shown below.

Generated Data
Compress Option	%SQUEEZE	Pages	Page Size	Size (Bytes)	Compression Ratio
None	No	27,028	16,384	442,826,752	N.A.
None	Yes	19,609	16,384	321,273,856	73%
Binary	No	23,182	16,384	379,813,888	86%
Binary	Yes	20,230	16,384	331,448,320	75%
Char	No	22,520	16,384	368,967,680	83%
Char	Yes	20,043	16,384	328,384,512	74%

This final set of compression ratios helps to confirm the pattern that we have seen -- the best compression is obtained with data sets that contain small integer numeric variables. Data sets that contain large numbers of real numeric variables are limited in the amount of compression that can be obtained.

In general, compression is a valuable technique to optimize disk space requirements for a SAS data set. The cost of compression is the additional CPU cycles required to compress the data set initially, and then to decompress a compressed data set prior to its use. However, the amount of time that is used might be justified because I/O speed is (usually) orders of magnitude slower than CPU speed, and the less I/O performed, the better.

Additional savings in disk space might be obtained using %SQUEEZE on the SAS data set, although this operation will require more resources. However, because your compression results will vary based on the mix of integer and real numeric, and character variables, you should experiment with using COMPRESS=YES or COMPRESS=CHAR and %SQUEEZE on your data. One approach that might be useful is to use %SQUEEZE and compress (as appropriate) the initial set of data files for a project, then use the appropriate value for the compression option as the default during the life of the project.

About the author:
Ross Bettinger is a SAS Analytical Consultant. He provides support for Enterprise Miner and has been involved with data mining projects for 7 years. He has been a SAS user for 15 years. His professional interests are related to data mining, statistical analysis of data, feature selection and transformation, model building, and algorithm development.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

Date Modified:	2005-09-08 03:03:13
Date Created:	2004-10-07 12:53:41

Product Family	Product	Host	SAS Release
			Starting	Ending
SAS System	Base SAS	All	n/a	n/a

Support

Sample 24805: %SQUEEZE-ing Before Compressing Data

Operating System and Release Information