When you specify the ENCODING=
data set option, the encoding for the output data set might require
more space than the original data set. For example, when writing DBCS
data in a Windows environment using the UTF8 encoding, each DBCS character
might require three bytes. To avoid data truncation, each variable
must have a width that is 1.5 times greater than the width of the
original data.
When you process a SAS
data file that requires transcoding, you can request that the CVP
(character variable padding) engine expand character variable lengths
so that character data truncation does not occur. (A variable's length
is the number of bytes used to store each of the variable's values.)
Character data truncation
can occur when the number of bytes for a character in one encoding
is different from the number of bytes for the same character in another
encoding, such as when a single-byte character set (SBCS) is transcoded
to a double-byte character set (DBCS) or to a multi-byte character
set (MBCS). An SBCS represents each character in one byte, and a DBCS
represents each character in two bytes. An MBCS represents characters
in a varying length from one to four bytes. For example, when transcoding
from Wlatin2 to a Unicode encoding, such as UTF-8, the variable lengths
(in bytes) might not be sufficient to hold the values, and the result
is character data truncation.
Using the CVP engine,
you specify an expansion amount so that variable lengths are expanded
before transcoding, then the data is processed. Think of the CVP engine
as an intermediate engine that is used to prepare the data for transcoding.
After the lengths are increased, the primary engine, such as the default
base engine, is used to do the actual file processing.
The CVP engine is a
read-only engine for SAS data files only. You can request character
variable expansion (for example with the LIBNAME statement) in either
of the following ways:
-
explicitly specify the CVP engine
and using the default expansion of 1.5 times the variable lengths.
-
implicitly specifying the CVP engine
with the LIBNAME options CVPBYTES= or CVPMULTIPLIER=. The options
specify the expansion amount. In addition, you can use the CVPENGINE=
option to specify the primary engine to use for processing the SAS
file; the default is the default SAS engine.
For example, the following
LIBNAME statement explicitly assigns the CVP engine. Character variable
lengths are increased using the default expansion, which multiples
the lengths by 1.5. For example, a character variable with a length
of 10 will have a new length of 15, and a character variable with
a length of 100 will have a new length of 150:
libname expand cvp ' SAS data-library';
Note: The expansion amount must
be large enough to accommodate any expansion. Otherwise, truncation
will still occur.
Note: For processing that conditionally
selects a subset of observations by using a WHERE expression, using
the CVP engine might affect performance. Processing the file without
using the CVP engine might be faster than processing the file using
the CVP engine. For example, if the data set has indexes, the indexes
will not be used in order to optimize the WHERE expression if you
use the CVP engine.
For more information
and examples, see the CVP options in the
LIBNAME Statement in SAS Statements: Reference.