Multibyte Character Functions

SAS/C Implementation of Multibyte Character Sequences

The ISO/ANSI C Standard defines a multibyte character as consisting of 1 or more bytes, but it leaves the implementation of multibyte sequences up to individual vendors. The SAS/C Library supports both single-byte and multibyte characters. Characters consisting of more than 1 byte are supported in the context of the EBCDIC Double-Byte Character Set (DBCS). "Multibyte Character Support" in Chapter 4, "Compiler Processing and Code Generation Conventions," of SAS/C Compiler and Library User's Guide discusses the SAS/C implementation of multibyte characters in more detail.

There are two kinds of DBCS sequences, mixed and pure (in Standard terminology, multibyte and wide; this discussion uses DBCS terms). Mixed sequences may contain both single- and double-byte characters, while pure sequences contain only double-byte characters.

Mixed DBCS Sequences

Several methods exist for handling mixed DBCS sequences. For example, an encoding scheme may set aside a subrange of values to signal multibyte sequences. Another popular encoding scheme sets aside a single byte value to indicate a shift out from a normal interpretation of character codes to an alternate interpretation, where groups of bytes represent certain characters. This method is referred to as shift-out/shift-in encoding and is the method the SAS/C Compiler uses to handle multibyte sequences. This encoding scheme uses shift states, which indicate how a byte value or set of byte values will be interpreted. The SAS/C Compiler uses shift-out/shift-in encoding because it is the DBCS encoding defined for the EBCDIC character set.

A mixed DBCS sequence must follow these rules:

DBCS sequences must begin and end in the initial shift state, that is, 1 byte per character.

any subsequence of double-byte characters must be preceded with and followed by a state-dependent encoding, SO/SI (shift-out/shift-in).

SO	indicates a shift out from the normal single-byte interpretation to an alternative interpretation of characters.
SI	indicates a shift in, that is, a return to the usual single-byte interpretation.

The hexadecimal value for SO is \x0E and the value for SI is \x0F . For example, the following is a mixed DBCS string in hex:

\x81\x82\x83\x0E\x41\x52\x0F\x81

The \x41\x52 between the \x0E and \x0F is a double-byte character. The other characters are single-byte.

SO/SIs must be paired.
SO/SIs cannot be nested.
an SO/SI pair must surround an even number of bytes.
the Standard requires that a null character terminate a multibyte sequence, even in the double-byte shift state. This is a departure from DBCS sequences in other languages, which always require an explicit shift back into the single-byte shift state before the end of a sequence.

In the single-byte state, each character is represented by 1 byte and has its EBCDIC value. For example, the character constant 'a' is \x81 in hex.

In the double-byte state, each character is represented by 2 bytes. Double-byte characters must conform to the following constraints:

all first bytes must have values between \x41 and \xFE , except for the encoding of the blank space.
all second bytes must have values between \x41 and \xFE , except for the encoding of the blank space.
the blank space is represented by \x40\x40 .

The SAS/C implementation of multibyte characters does not allow empty SO/SI pairs ( \x0E\x0F ). For example, the following sequence (in hex), which might be construed as a single multibyte character, is not valid:

\x0E\x0F\x0E\x0F\x0E\x0F\x0E\x41\x81\x0F

This restriction is imposed because the number of bytes used to represent a multibyte character would, in theory, be unbounded; but the Standard requires an implementation to define a maximum byte-length for a multibyte character.

On the other hand, consecutive SI/SO pairs ( \x0F\x0E ) are permitted because they may result from string concatenation. For example, the following sequence (in hex) is valid:

\x0E\x41\x81\x0F\x0E\x41\x83\x0F

Pure DBCS Sequences

Pure DBCS sequences contain only double-byte characters. Thus, no SO/SI pairs are needed. The Standard supports pure sequences by providing a type capable of holding wide characters. This type, wchar_t , is implementation-defined as an integer type capable of representing all the codes for the largest character set in locales supported by the implementation. wchar_t is implemented by the SAS/C Library in <stddef.h> as follows:

typedef unsigned short wchar_t;

Converting Sequences

When converting from mixed to pure, all SO/SI pairs are removed from the sequence, and the double-byte characters are moved into corresponding wchar_t elements. When a mixed character sequence contains characters that require only a single byte, these characters are converted to wchar_t , but their values are unchanged. For example, the mixed string ( "abc" ) is represented as follows:

\x81\x82\x83\x00

When converted to a pure DBCS sequence, the string will become the following:

\x00\x81\x00\x82\x00\x83\x00\x00

Use the mbtowc function to convert 1 multibyte character to a double-byte character. Use the mbstowcs function to convert a sequence of multibyte characters to a double-byte sequence. Note that this function assumes the sequence is terminated by the null character, \x00 . You also can use regular string-handling functions with mixed DBCS sequences. For example, you can use strlen to determine the byte-length of a sequence, as long as the sequence is null-terminated.

When converting from pure to mixed, SO/SI pairs are added to the sequence as necessary. Use the wctomb function to convert 1 double-byte character to a multibyte character. Use the wcstombs function to convert a sequence of double-byte characters to a multibyte sequence. Note that this function assumes the sequence is terminated by the null wide character, \x00\x00 .

DBCS Support with SPE

The multibyte character functions can be used with the SPE framework. Normally this framework does not support locales, and by default DBCS support is not enabled. To enable DBCS support with SPE, turn on the CRABDBCS bit in CRABFLGM in your start-up routine or in L$UMAIN.

Formatted I/O Functions and Multibyte Character Sequences

Mixed DBCS sequences are supported in the format string for the formatted I/O functions such as printf , sprintf , scanf , sscanf , and strftime as required by the Standard. Recognition of a mixed sequence within a format requires that a double-byte locale such as "DBCS" be in effect. Mixed sequences are treated like any other character sequence in the format string with one exception; they are copied unchanged to output or matched on scanf input, but invalid sequences may cause premature termination of the function. The conversion specifier % and specifications associated with it, which are imbedded within the format string, are recognized only while in single-byte mode, which is the initial shift state at the beginning of the format string.

Chapter Contents
Previous
Next
Top of Page