Data Compression

Overview of Data Compression

SAS LASR Analytic Server supports compression for in-memory tables. All the analytic statements, such as PERCENTILES, LOGISTIC, and so on, in the IMSTAT procedure are supported for compressed tables as well as regular, uncompressed, tables. Clients like SAS Visual Analytics can also operate on compressed tables as well.
All compression is performed by the server. In other words, when you transfer a table to the server in a DATA step and specify the SQUEEZE= data set option, the rows are sent to the server as is, and the server compresses the rows. The server uses the zlib compression algorithm that is described in RFC 1950, "ZLIB Compressed Data Format Specification."
All data in a row, both character and number variables, are compressed. Every row in a table is compressed, the server does not support some rows in compressed form and others as uncompressed. The server can report the uncompressed size of the table, the compressed size, and the compression ratio.
For matrices of computed doubles (with lots of decimal places), compression might not reduce the storage requirements at all. For rows with many long character variables that consist mostly of blanks, the compression ratio can be very high. For rows with mixed variables where most doubles do not have fractional parts and most character variables have a small amount of blank padding, the compression ratio is typically moderate. As with most cases of using compression, character variables tend to compress the most and the ratio depends on your data.

Compressed Tables and the DATA Step

The following example shows how to use the SQUEEZE= data set option for the SAS LASR Analytic Server.
Creating a Compressed Table with a DATA Step
libname example sasiola host="grid001.example.com" port=10010 tag=hps;

data example.prdsale(squeeze=yes);
    set sashelp.prdsale;
run;
After the table is loaded to memory, you can access the compressed table with the Example.Prdsale libref.
The server supports the APPEND= option for compressed tables. The following example shows how to add new rows (uncompressed) to the compressed table:
Appending Rows to a Compressed Table
data example.prdsale(append=yes);
    somelib.newrows;
run;
Because the Example.Prdsale table is already compressed, the new rows are automatically compressed as they are appended to the table. Specifying SQUEEZE= with APPEND= has no effect. If the table is compressed, the server compresses the new rows. If the table is not compressed, the server does not compress the new rows (even if SQUEEZE=YES is specified). The compressed or uncompressed state of the table determines how the rows are appended.
Partitioning and compression are supported together. The following example creates a new in-memory table that is partitioned and compressed:
Creating a Partitioned and Compressed Table
data example.iris(partition=(species) squeeze=yes);
    set sashelp.iris;
run;

data example.iris(append=yes);
    set somelib.moreirises;
run;
In the first DATA statement, the Iris data set is loaded to memory on the server and is partitioned by the formatted values of the Species variable. The table is also compressed. In the second DATA statement, the table is appended to with more rows. Because the in-memory table is already partitioned and compressed, the new rows are automatically partitioned and compressed when they are appended.

Compressed Tables and the LASR Procedure

The LASR procedure is used for loading data to memory on distributed SAS LASR Analytic Server.
The following example shows how to read a SAS data set and compress it in memory on the server:
proc lasr add data=sashelp.prdsale port=10010 squeeze;
    performance host="grid001.example.com";
run;
The example uses the SQUEEZE option to read the Prdsale data set from the Sashelp library and compress it in-memory on the server. Be aware that you must specify the SQUEEZE option for each table that you want to load in compressed form. You cannot specify SQUEEZE with the CREATE option when you start a server and have the server automatically compress all tables.
The SQUEEZE option works when reading data from SAS/ACCESS engines, too. The resulting in-memory table is compressed whether you read data serially from a standard database or you read data in parallel from a distributed database like Greenplum Database.
When you read SASHDAT tables into memory, the compression for the resulting in-memory tables depends on the following:
  • whether a WHERE clause is used
  • whether the SASHDAT table is compressed on disk
If you specify a WHERE clause and the SQUEEZE option, then the server evaluates the WHERE clause as it reads data from HDFS and compresses the rows that meet the WHERE clause criteria. The memory efficiencies of the SASHDAT table format are forfeited in this scenario because the server had to apply the WHERE clause.
If you do not specify a WHERE clause, then the server ignores the SQUEEZE option and relies on whether the SASHDAT table is compressed. If the SASHDAT table is compressed, then the in-memory representation of the table is also compressed. If the SASHDAT table is not compressed, then the in-memory representation is not compressed either. The server ignores the option so that it can keep the memory efficiencies of the SASHDAT table format—when a SASHDAT table is loaded to memory, the in-memory representation is identical to the on-disk representation.

Performance Considerations

Compression exchanges less memory use for more CPU use. It slows down any request that processes the data. An in-memory table consists of blocks of rows. When the server works with a compressed table, the blocks of rows must be uncompressed before the server can work with the variables. In some cases, a request can require five times longer to run with a compressed table rather than an uncompressed table.
For example, if you want to summarize two variables in a table that has 100 variables, all 100 columns must be uncompressed in order to locate the data for the two variables of interest. If you specify a WHERE clause, then the server must uncompress the data before the WHERE clause can be applied. Like the example where only two of 100 variables are used, if the WHERE clause is very restrictive, then there is a substantial performance penalty to filter out most of the rows.
Working with SASHDAT tables that are loaded from HDFS is the most memory-efficient way to use the server. Using compressed SASHDAT tables preserves the memory efficiencies, but still incurs the performance penalty of uncompressing the rows as the server operates on each row..

Interactions

The interactions for compressed tables and SAS programs are as follows:
  • You can use a compressed table in programs like any other table.
  • You can define calculated columns for compressed tables with the COMPUTE statement or with the TEMPNAMES= and TEMPEXPRESS= options.
  • You can use SIGNER= security with compressed tables.
  • You can append to compressed tables with the SET statement or the APPEND= data set option. This is also supported for compressed tables that have partitioning. However, you cannot append to a compressed table that is partitioned and has an ORDERBY specification.
  • You can use the UPDATE statement with a compressed table.
  • You can use a compressed table with a statement in the IMSTAT procedure that produces a temporary table. Whether the resulting temporary table is compressed depends on whether you specify the TEMPSQUEEZE option in the IMSTAT procedure. The following statements in the IMSTAT procedure support creating temporary tables:
    • ACCESS
    • AGGREGATE
    • BALANCE
    • CLUSTER
    • CROSSTAB
    • DECISIONTREE
    • DISTINCT
    • GENMODEL
    • GLM
    • GROUPBY
    • LOGISTIC
    • MDSUMMARY
    • PARTITION
    • PERCENTILE
    • RANDOMWOODS
    • SCHEMA (does not support creating compressed temporary tables)
    • SCORE
    • SUMMARY
  • You can use the BALANCE and PARTITION statements to create a compressed table when the TEMPSQUEEZE option is used. Applying orderby and compression at the same time has a significant performance penalty.
  • You can use the UNCOMPRESS statement with a compressed and partitioned table. The temporary table that is created is partitioned according to the original table.
  • You can use DELETEROWS and the PURGE option with a compressed table. There is a significant performance penalty for the PURGE option.
  • You can create an empty table in compressed form, using either a DATA step or the CREATETABLE statement of the IMSTAT procedure. Although the table has no rows, rows that are appended to it later are compressed.
  • You can use the SCORE statement with a compressed table.
  • You can use a WHERE clause with the COMPRESS and UNCOMPRESS statements to move a subset of the rows to the temporary table.
  • You can use the SAVE statement to save a compressed table to a SASHDAT table that is either compressed or uncompressed.
  • You can use the SAVE statement to save an uncompressed table to a SASHDAT table that is in either compressed or uncompressed.
  • You can specify the ORDERBY= option with the FETCH statement to read from a compressed table. The ORDERBY= option has a performance penalty.
  • You can specify a sort order for a compressed table. Applying the sort has a performance penalty.
  • You can use the SET statement to append a compressed table to an uncompressed table. The new rows are uncompressed because the base table is uncompressed.
  • You can use the SET statement to append an uncompressed table to a compressed table. The new rows are compressed because the base table is compressed.
  • You can use the UNCOMPRESS statement to create an uncompressed temporary table. The uncompressed table provides better performance.

Limitations

The limitations for working with compressed tables are as follows:
  • You cannot apply compression to views that are created with the SCHEMA statement.
  • You cannot use compressed tables with the SCHEMA statement, even if you specify MODE=TABLE.
  • You cannot append to compressed tables that have an ORDERBY specification within the partitions.
  • The SAS LASR Analytic Server engine and the SAS Data in HDFS engine engine do not support the appending to compressed tables when the ORDERBY= data set option is used. However, you can load the table to memory in uncompressed form with partitioning and ordering and then use the COMPRESS statement. This creates a compressed and partitioned temporary table in which the rows are ordered.