mblen
mbstowcs
mbtowc
wcstombs
wctomb
There are two kinds of DBCS sequences, mixed and pure (in Standard terminology, multibyte and wide; this discussion uses DBCS terms). Mixed sequences may contain both single- and double-byte characters, while pure sequences contain only double-byte characters.
A mixed DBCS sequence must follow these rules:
\x0E
and the value for SI is
\x0F
. For example, the following is a mixed DBCS string in hex:
\x81\x82\x83\x0E\x41\x52\x0F\x81The
\x41\x52
between the \x0E
and \x0F
is a double-byte character. The other
characters are single-byte.
'a'
is
\x81
in hex.
In the double-byte state, each character is represented by 2 bytes. Double-byte characters must conform to the following constraints:
\x41
and \xFE
, except for
the encoding of the blank
space.
\x41
and \xFE
, except for
the encoding of the blank
space.
\x40\x40
.
\x0E\x0F
). For example, the following sequence
(in hex), which might be construed as a single multibyte character, is not valid:
\x0E\x0F\x0E\x0F\x0E\x0F\x0E\x41\x81\x0FThis restriction is imposed because the number of bytes used to represent a multibyte character would, in theory, be unbounded; but the Standard requires an implementation to define a maximum byte-length for a multibyte character.
On the other hand, consecutive SI/SO pairs (\x0F\x0E
)
are permitted because they may result from string concatenation.
For example, the following sequence (in hex) is valid:
\x0E\x41\x81\x0F\x0E\x41\x83\x0F
wchar_t
, is implementation-defined as an integer type capable of
representing all the codes for the largest character set in locales
supported by the implementation. wchar_t
is implemented by the
SAS/C Library in <stddef.h>
as follows:
typedef unsigned short wchar_t;
wchar_t
elements. When a mixed character sequence
contains characters that require only a single byte, these characters
are converted to wchar_t
, but their values are unchanged. For
example, the mixed string ("abc"
) is represented as follows:
\x81\x82\x83\x00When converted to a pure DBCS sequence, the string will become the following:
\x00\x81\x00\x82\x00\x83\x00\x00Use the
mbtowc
function to convert 1 multibyte character to a
double-byte character. Use the mbstowcs
function to convert a
sequence of multibyte characters to a double-byte sequence. Note that
this function assumes the sequence is terminated by the null character,
\x00
. You also can use regular string-handling functions
with mixed DBCS sequences. For example, you can use strlen
to
determine the byte-length of a sequence, as long as the sequence is
null-terminated.
When converting from pure to mixed, SO/SI pairs are added to the
sequence as necessary. Use the wctomb
function to convert 1
double-byte character to a multibyte character. Use the wcstombs
function to convert a sequence of double-byte characters to a multibyte
sequence. Note that this function assumes the sequence is terminated
by the null wide character, \x00\x00
.
CRABDBCS
bit in
CRABFLGM
in your start-up
routine or in L$UMAIN.
printf
, sprintf
, scanf
, sscanf
,
and strftime
as required by the Standard. Recognition of a mixed sequence
within a format requires that a double-byte locale such as
"DBCS"
be in effect.
Mixed sequences are treated like any other character sequence in the format string
with one exception; they are copied unchanged to output or matched on scanf
input, but invalid sequences may cause premature termination of the
function. The conversion specifier %
and specifications
associated with it, which are imbedded within the format string, are
recognized only while in single-byte mode, which is the initial shift
state at the beginning of the format string.
"S370"
and "POSIX"
do not support
DBCS sequences.
The default locale, ""
, may or may not support DBCS
sequences,
depending on the values of locale-related environment variables.
Of the three locales supplied by the SAS/C Library, "DBCS"
and "DBEX"
support DBCS sequences, while "SAMP"
does not.
The macro MB_CUR_MAX
, defined in <stdlib.h>
, defines the longest
sequence of bytes needed to represent a single multibyte character in
the current locale. The macro MB_LEN_MAX
, on the other hand, is not
locale-dependent and defines the longest multibyte character permitted
across all locales.
#include <stdlib.h> int mblen(const char *s, size_t n);
mblen
determines how many bytes are needed to represent the multibyte
character pointed to by s
.
n
specifies the maximum number of bytes of the multibyte character
sequence to examine.
s
is not NULL
, the return value is as follows:
0
s
points to the null character.
n
or fewer bytes constitute a valid
multibyte character.
-1
n
or fewer bytes do not constitute a valid
multibyte character.
s
is NULL
, the return value is as follows:
0
mblen
encounters invalid data; a
return value of -1
is the only indication of an error.
/* This example counts multibyte characters (not including */ /* terminating null) in a DBCS mixed string using mblen(). */ #include <locale.h> #include <limits.h> #include <stdlib.h> #include <stdio.h> /* "strptr" points to the beginning of a DBCS MIXED string. */ /* RETURNS: number of multibyte characters */ int count1(char *strptr) { int i = 0; /* number of multibyte characters found */ int charlen; /* byte length of current characte */ /* Inform library that we will be accepting a DBCS string. */ /* That is, SO and SI are not regular control characters: */ /* they indicate a change in shift state. */ setlocale(LC_ALL, "dbcs"); /* Reset to initial shift state. (A valid mixed string */ /* must begin in initial shift state). */ mblen(NULL, 0); /* One loop iteration per character. Advance "strptr" by */ /* number of bytes consumed. */ while (charlen = mblen(strptr, MB_LEN_MAX)) { if (charlen < 0) { fputs("Invalid MIXED DBCS string", stderr); abort(); fclose(stderr); } strptr += charlen; i++; } return i; }
#include <stdlib.h> size_t mbstowcs(wchar_t *pwcs, const char *s, size_t n);
mbstowcs
converts a sequence of multibyte characters (mixed DBCS
sequence) pointed to by s
into a sequence of corresponding wide characters (pure DBCS
sequence) and stores the output sequence in the array pointed to by
pwcs
.
The multibyte character sequence is assumed to begin in the initial shift state.
n
specifies the maximum number of wide characters to be stored.
mbstowcs
returns the
number of elements of pwcs
that were modified, excluding the
terminating 0
code, if any. If the sequence of multibyte characters is
invalid, mbstowcs
returns -1
.
mbtowc
.
If copying takes place between objects that overlap, the behavior of
mbstowcs
is undefined.
A diagnostic is not issued if mbstowcs
encounters invalid data; a
return value of -1
is the only indication of an error.
mbstowcs
and wcstomb
.
#include <locale.h> #include <limits.h> #include <stdlib.h> #include <stdio.h> #define MAX_CHARACTERS 81 /* "old_string" is the input MIXED DBCS string. "new_string" */ /* is the output MIXED DBCS string. "old_wchar" is the */ /* multibyte character to be replaced. "new_wchar" is the */ /* multibyte character to replace with. */ void mbsrepl(char *old_string, char *new_string, wchar_t old_wchar, wchar_t new_wchar) { wchar_t work[MAX_CHARACTERS]; int nchars; int i; /* Inform library that we will be accepting a DBCS string.*/ /* That is, SO and SI are not regular control characters: */ /* they indicate a change in shift state. */ setlocale(LC_ALL, "dbcs"); nchars = mbstowcs(work, old_string, MAX_CHARACTERS); if (nchars < 0) { fputs("Invalid DBCS string.\n", stderr); fclose(stderr); abort(); } /* Perform the actual substitution. */ for (i = 0; i < nchars; i++) if (work[i] == old_wchar) work[i] = new_wchar; /* Convert back to MIXED format. */ nchars = wcstombs(new_string, work, MAX_CHARACTERS); /* See if the replacement caused the string to overflow. */ if (nchars == MAX_CHARACTERS) { fputs("Replacement string too large.\n", stderr); abort(); fclose(stderr); } }
#include <stdlib.h> int mbtowc(wchar_t *pwc, const char *s, size_t n);
mbtowc
determines how many bytes are needed to represent the
multibyte character pointed to by s
. If s
is not NULL
,
mbtowc
then stores the corresponding wide character in the array
pointed to by pwc
.
n
specifies the maximum number of bytes to examine in the array
pointed to by pwc
.
s
is not NULL
, the return value is as follows:
0
s
points to the null character.
n
or fewer bytes constitute a valid
multibyte character.
-1
n
or fewer bytes do not constitute a valid
multibyte character.
s
is NULL
, the return value is as follows:
0
mbtowc
encounters invalid data; a
return value of -1
is the only indication of an error.
mbtowc
.
#include <locale.h> #include <limits.h> #include <stdlib.h> #include <stdio.h> /* "begstr" points to the beginning of a DBCS MIXED string. */ /* "mbc_sought" is the character value we're looking for. */ int mbfind(char *begstr, wchar_t int mbc_sought) { int mbclen; /* length (in bytes) of current character */ wchar_t mbc; /* value of current character */ char *strptr; /* pointer to current location in string */ strptr = begstr; /* Inform library that we will be accepting a DBCS string.*/ /* That is, SO and SI are not regular control characters: */ /* they indicate a change in shift state. */ setlocale(LC_ALL, "dbcs"); /* Reset to initial shift state. (A valid mixed string */ /* must begin in initial shift state). */ mbtowc((wchar_t *)NULL, NULL, 0); /* One loop iteration per character. Advance "strptr" by */ /* number of bytes consumed. */ while (mbclen = mbtowc(&mbc, strptr, MB_LEN_MAX)) { if (mbclen < 0) { fputs("Invalid pure DBCS string\n", stderr); abort(); } if (mbc == mbc_sought) break; strptr += mbclen; } /* Last character was not '\0' -- must have found it */ if (mbclen) { printf("MBFIND: found at byte offset %d\n", strptr - begstr); return 1; } else { puts("MBFIND: character not found\n"); return 0; } }
#include <stdlib.h> size_t wcstombs(char *s, const wchar_t *pwcs, size_t n);
wcstombs
converts a sequence of wide characters (pure DBCS
sequence) to a sequence of multibyte characters (mixed DBCS sequence).
The wide characters are in the array pointed to by pwcs
, and the
resulting multibyte characters are stored in the array pointed to by
s
. The resulting multibyte character sequence begins in the
initial shift state.
n
specifies the maximum number of bytes to be filled with
multibyte characters. The conversion stops if a multibyte character
would exceed the limit of n
bytes or if a null character is
stored.
wcstombs
returns the
number of bytes of s
that were modified, excluding the terminating
0
byte, if any. If the sequence of multibyte characters is invalid,
wcstombs
returns -1
.
wcstombs
is undefined.
A diagnostic is not issued if wcstombs
encounters invalid data; a
return value of -1
is the only indication of an error.
mbstowcs
.
#include <stdlib.h> int wctomb(char *s, wchar_t wchar);
wctomb
determines how many bytes are needed to represent the
multibyte character corresponding to the wide (pure DBCS) character
whose value is wchar
, including any change in shift state. It
stores the multibyte character representation in the array pointed to
by s
, assuming s
is not NULL
. If the value of
wchar
is 0
, wctomb
is left in the initial shift state.
s
is not NULL
, the return value is the number of bytes
that make up the multibyte character corresponding to the value of wchar
.
If s
is NULL
, the return value is as follows:
0
MB_CUR_MAX
macro.
wctomb
encounters invalid data; a
return value of -1
is the only indication of an error.
wctomb
.
#include <stdlib.h> #include <locale.h> #include <limits.h> #include <stdlib.h> #include <stdio.h> #define MAX_CHARACTERS 81 /* "pure_string" is the input PURE DBCS string. */ /* "mixed_string" the output MIXED DBCS string. */ void mbline(wchar_t *pure_string, char *mixed_string) { int i; int mbclen; wchar_t wc; /* Inform library that we will be accepting a DBCS string. */ /* That is, SO and SI are not regular control characters: */ /* they indicate a change in shift state. */ setlocale(LC_ALL, "dbcs"); wctomb(NULL, 0); /* Reset to initial shift state. */ /* One loop iteration per character. Advance "mixed_string"*/ /* by number of bytes in character. */ i = 0; do { wc = pure_string[i++]; mbclen = wctomb(mixed_string, wc); if (mbclen < 0) { puts("Invalid PURE DBCS string.\n"); abort(); fclose(stdout); } mixed_string += mbclen; } while (wc != L'\n'); *mixed_string = '\0'; }
Copyright (c) 1998 SAS Institute Inc. Cary, NC, USA. All rights reserved.