Chapter Contents |
Previous |
Next |
Multibyte Character Functions |
Mixed DBCS Sequences |
A mixed DBCS sequence must follow these rules:
SO | indicates a shift out from the normal single-byte interpretation to an alternative interpretation of characters. |
SI | indicates a shift in, that is, a return to the usual single-byte interpretation. |
The hexadecimal value for SO is
\x0E
and the value for SI is
\x0F
. For example, the
following is a mixed
DBCS string in hex:
\x81\x82\x83\x0E\x41\x52\x0F\x81
The
\x41\x52
between the
\x0E
and
\x0F
is a double-byte
character. The other characters are single-byte.
In the single-byte state, each character is represented
by 1 byte and has its EBCDIC value. For example, the character constant
'a'
is
\x81
in hex.
\x41
and
\xFE
, except for the encoding of the
blank space.
\x41
and
\xFE
, except for the encoding of the blank space.
\x40\x40
.
The SAS/C implementation of multibyte characters does
not allow empty SO/SI pairs (
\x0E\x0F
). For example, the following sequence (in hex), which might
be construed as a single multibyte character, is not valid:
\x0E\x0F\x0E\x0F\x0E\x0F\x0E\x41\x81\x0F
This restriction is imposed because the number of bytes used to represent a multibyte character would, in theory, be unbounded; but the Standard requires an implementation to define a maximum byte-length for a multibyte character.
On the other hand, consecutive SI/SO pairs (
\x0F\x0E
) are permitted because they may result
from string concatenation. For example, the following sequence (in hex) is
valid:
\x0E\x41\x81\x0F\x0E\x41\x83\x0F
Pure DBCS Sequences |
Pure DBCS sequences contain only double-byte characters. Thus,
no SO/SI pairs are needed. The Standard supports pure sequences by providing
a type capable of holding wide characters. This type,
wchar_t
, is implementation-defined as an integer
type capable of representing all the codes for the largest character set in
locales supported by the implementation.
wchar_t
is implemented by the SAS/C Library in
<stddef.h>
as follows:
typedef unsigned short wchar_t;
Converting Sequences |
When converting from mixed
to pure, all SO/SI pairs are removed from the sequence, and the double-byte
characters are moved into corresponding
wchar_t
elements. When a mixed character sequence contains characters
that require only a single byte, these characters are converted to
wchar_t
, but their values are unchanged. For
example, the mixed string (
"abc"
) is represented as follows:
\x81\x82\x83\x00
When converted to a pure DBCS sequence, the string will become the following:
\x00\x81\x00\x82\x00\x83\x00\x00
Use the
mbtowc
function to convert 1 multibyte character to a double-byte character.
Use the
mbstowcs
function
to convert a sequence of multibyte characters to a double-byte sequence.
Note that this function assumes the sequence is terminated by the null character,
\x00
. You also can use regular
string-handling functions with mixed DBCS sequences. For example, you can
use
strlen
to determine
the byte-length of a sequence, as long as the sequence is null-terminated.
When converting from pure to mixed, SO/SI pairs are
added to the sequence as necessary. Use the
wctomb
function to convert 1 double-byte character to a multibyte character.
Use the
wcstombs
function
to convert a sequence of double-byte characters to a multibyte sequence.
Note that this function assumes the sequence is terminated by the null wide
character,
\x00\x00
.
DBCS Support with SPE |
The multibyte character
functions can be used
with the SPE framework. Normally this framework does not support locales,
and by default DBCS support is not enabled. To enable DBCS support with SPE,
turn on the
CRABDBCS
bit
in
CRABFLGM
in your start-up
routine or in L$UMAIN.
Formatted I/O Functions and Multibyte Character Sequences |
Mixed DBCS sequences are supported in the format string
for the formatted I/O functions such as
printf
,
sprintf
,
scanf
,
sscanf
, and
strftime
as required by the Standard. Recognition of a mixed sequence
within a format requires that a double-byte locale such as
"DBCS"
be in effect. Mixed sequences are treated
like any other character sequence in the format string with one exception;
they are copied unchanged to output or matched on
scanf
input, but invalid sequences may cause premature termination
of the function. The conversion specifier
%
and specifications associated with it, which are imbedded within
the format string, are recognized only while in single-byte mode, which is
the initial shift state at the beginning of the format string.
Chapter Contents |
Previous |
Next |
Top of Page |
Copyright © 2001 by SAS Institute Inc., Cary, NC, USA. All rights reserved.