Multibyte Character Functions

Introduction

This chapter introduces fundamental concepts of multibyte character sequences, discusses the SAS/C implementation of multibyte character sequences, and describes five functions designed specifically to work with multibyte character sequences. These functions are as follows:

mblen: determines the length of a multibyte character.
mbstowcs: converts a multibyte character sequence to a wide character sequence.
mbtowc: converts a single multibyte character to a wide character.
wcstombs: converts a wide character sequence to a multibyte character sequence.
wctomb: converts a single wide character to a multibyte character.

SAS/C Implementation of Multibyte Character Sequences

The ISO/ANSI C Standard defines a multibyte character as consisting of 1 or more bytes, but it leaves the implementation of multibyte sequences up to individual vendors. The SAS/C Library supports both single-byte and multibyte characters. Characters consisting of more than 1 byte are supported in the context of the EBCDIC Double-Byte Character Set (DBCS). "Multibyte Character Support" in Chapter 4, "Compiler Processing and Code Generation Conventions," of SAS/C Compiler and Library User's Guide, Fourth Edition discusses the SAS/C implementation of multibyte characters in more detail.

There are two kinds of DBCS sequences, mixed and pure (in Standard terminology, multibyte and wide; this discussion uses DBCS terms). Mixed sequences may contain both single- and double-byte characters, while pure sequences contain only double-byte characters.

Mixed DBCS Sequences

Several methods exist for handling mixed DBCS sequences. For example, an encoding scheme may set aside a subrange of values to signal multibyte sequences. Another popular encoding scheme sets aside a single byte value to indicate a shift out from a normal interpretation of character codes to an alternate interpretation, where groups of bytes represent certain characters. This method is referred to as shift-out/shift-in encoding and is the method the SAS/C Compiler uses to handle multibyte sequences. This encoding scheme uses shift states, which indicate how a byte value or set of byte values will be interpreted. The SAS/C Compiler uses shift-out/shift-in encoding because it is the DBCS encoding defined for the EBCDIC character set.

A mixed DBCS sequence must follow these rules:

DBCS sequences must begin and end in the initial shift state, that is, 1 byte per character.
any subsequence of double-byte characters must be preceded with and followed by a state-dependent encoding , SO/SI (shift-out/shift-in).

SO
indicates a shift out from the normal single-byte interpretation to an alternative interpretation of characters.
SI
indicates a shift in, that is, a return to the usual single-byte interpretation.
The hexadecimal value for SO is \x0E and the value for SI is \x0F. For example, the following is a mixed DBCS string in hex:
```
 \x81\x82\x83\x0E\x41\x52\x0F\x81
 
```
The \x41\x52 between the \x0E and \x0F is a double-byte character. The other characters are single-byte.
SO/SIs must be paired.
SO/SIs cannot be nested.
an SO/SI pair must surround an even number of bytes.
the Standard requires that a null character terminate a multibyte sequence, even in the double-byte shift state. This is a departure from DBCS sequences in other languages, which always require an explicit shift back into the single-byte shift state before the end of a sequence.

In the single-byte state, each character is represented by 1 byte and has its EBCDIC value. For example, the character constant 'a' is \x81 in hex.

In the double-byte state, each character is represented by 2 bytes. Double-byte characters must conform to the following constraints:

all first bytes must have values between \x41 and \xFE, except for the encoding of the blank space.
all second bytes must have values between \x41 and \xFE, except for the encoding of the blank space.
the blank space is represented by \x40\x40.

The SAS/C implementation of multibyte characters does not allow empty SO/SI pairs (\x0E\x0F). For example, the following sequence (in hex), which might be construed as a single multibyte character, is not valid:

 \x0E\x0F\x0E\x0F\x0E\x0F\x0E\x41\x81\x0F

This restriction is imposed because the number of bytes used to represent a multibyte character would, in theory, be unbounded; but the Standard requires an implementation to define a maximum byte-length for a multibyte character.

On the other hand, consecutive SI/SO pairs (\x0F\x0E) are permitted because they may result from string concatenation. For example, the following sequence (in hex) is valid:

 \x0E\x41\x81\x0F\x0E\x41\x83\x0F

Pure DBCS Sequences

Pure DBCS sequences contain only double-byte characters. Thus, no SO/SI pairs are needed. The Standard supports pure sequences by providing a type capable of holding wide characters. This type, wchar_t, is implementation-defined as an integer type capable of representing all the codes for the largest character set in locales supported by the implementation. wchar_t is implemented by the SAS/C Library in <stddef.h> as follows:

 typedef unsigned short wchar_t;

Converting Sequences

When converting from mixed to pure, all SO/SI pairs are removed from the sequence, and the double-byte characters are moved into corresponding wchar_t elements. When a mixed character sequence contains characters that require only a single byte, these characters are converted to wchar_t, but their values are unchanged. For example, the mixed string ("abc") is represented as follows:

 \x81\x82\x83\x00

When converted to a pure DBCS sequence, the string will become the following:

 \x00\x81\x00\x82\x00\x83\x00\x00

Use the mbtowc function to convert 1 multibyte character to a double-byte character. Use the mbstowcs function to convert a sequence of multibyte characters to a double-byte sequence. Note that this function assumes the sequence is terminated by the null character, \x00. You also can use regular string-handling functions with mixed DBCS sequences. For example, you can use strlen to determine the byte-length of a sequence, as long as the sequence is null-terminated.

When converting from pure to mixed, SO/SI pairs are added to the sequence as necessary. Use the wctomb function to convert 1 double-byte character to a multibyte character. Use the wcstombs function to convert a sequence of double-byte characters to a multibyte sequence. Note that this function assumes the sequence is terminated by the null wide character, \x00\x00.

DBCS Support with SPE

The multibyte character functions can be used with the SPE framework. Normally this framework does not support locales, and by default DBCS support is not enabled. To enable DBCS support with SPE, turn on the CRABDBCS bit in CRABFLGM in your start-up routine or in L$UMAIN.

Formatted I/O Functions and Multibyte Character Sequences

Mixed DBCS sequences are supported in the format string for the formatted I/O functions such as printf, sprintf, scanf, sscanf, and strftime as required by the Standard. Recognition of a mixed sequence within a format requires that a double-byte locale such as "DBCS" be in effect. Mixed sequences are treated like any other character sequence in the format string with one exception; they are copied unchanged to output or matched on scanf input, but invalid sequences may cause premature termination of the function. The conversion specifier % and specifications associated with it, which are imbedded within the format string, are recognized only while in single-byte mode, which is the initial shift state at the beginning of the format string.

Locales and Multibyte Character Sequences

The processing of multibyte character sequences is dependent on the current locale. (See Localization for a full discussion of locales.) For example, some locales support DBCS sequences and some do not. The standard locales "S370" and "POSIX" do not support DBCS sequences. The default locale, "", may or may not support DBCS sequences, depending on the values of locale-related environment variables. Of the three locales supplied by the SAS/C Library, "DBCS" and "DBEX" support DBCS sequences, while "SAMP" does not.

The macro MB_CUR_MAX, defined in <stdlib.h>, defines the longest sequence of bytes needed to represent a single multibyte character in the current locale. The macro MB_LEN_MAX, on the other hand, is not locale-dependent and defines the longest multibyte character permitted across all locales.

Function Descriptions

Descriptions of each multibyte character function follow. Each description includes a synopsis, a description, discussions of return values and portability issues, and an example. Also, errors, cautions, diagnostics, implementation details, and usage notes are included if appropriate. None of the multibyte character functions are supported by traditional UNIX C Compilers.

mblen -- Determine Length of a Multibyte Character

SYNOPSIS

 #include <stdlib.h>

 int mblen(const char *s, size_t n);

DESCRIPTION

mblen determines how many bytes are needed to represent the multibyte character pointed to by s.

n specifies the maximum number of bytes of the multibyte character sequence to examine.

RETURN VALUE

If s is not NULL, the return value is as follows:

0: is returned if s points to the null character.
length of the multibyte character: is returned if the next n or fewer bytes constitute a valid multibyte character.
-1: is returned if the next n or fewer bytes do not constitute a valid multibyte character.

If s is NULL , the return value is as follows:

nonzero value: is returned if the current locale supports state-dependent encodings.
0: is returned if the current locale does not support state-dependent encodings.

CAUTIONS

A diagnostic is not issued if mblen encounters invalid data; a return value of -1 is the only indication of an error.

EXAMPLE

    /* This example counts multibyte characters (not including   */ 
    /* terminating null) in a DBCS mixed string using mblen().   */ 

 #include <locale.h>
 #include <limits.h>
 #include <stdlib.h>
 #include <stdio.h>

    /* "strptr" points to the beginning of a DBCS MIXED string.  */ 
    /* RETURNS: number of multibyte characters                   */ 
 int count1(char *strptr)
 {
    int i = 0;          /* number of multibyte characters found  */ 
    int charlen;        /* byte length of current characte       */ 

    /* Inform library that we will be accepting a DBCS string.   */ 
    /* That is, SO and SI are not regular control characters:    */ 
    /* they indicate a change in shift state.                    */ 
    setlocale(LC_ALL, "dbcs");
       /* Reset to initial shift state. (A valid mixed string    */ 
       /* must begin in initial shift state).                    */ 
    mblen(NULL, 0);

      /* One loop iteration per character. Advance "strptr" by   */ 
      /* number of bytes consumed.                               */ 
    while (charlen = mblen(strptr, MB_LEN_MAX)) {
       if (charlen < 0) {
          fputs("Invalid MIXED DBCS string", stderr);
          abort();
          fclose(stderr);
       }
       strptr += charlen;
       i++;
    }
    return i;
 }

mbstowcs -- Convert a Multibyte Character Sequence to a Wide Sequence

SYNOPSIS

 #include <stdlib.h>

 size_t mbstowcs(wchar_t *pwcs, const char *s, size_t n);

DESCRIPTION

mbstowcs converts a sequence of multibyte characters (mixed DBCS sequence) pointed to by s into a sequence of corresponding wide characters (pure DBCS sequence) and stores the output sequence in the array pointed to by pwcs. The multibyte character sequence is assumed to begin in the initial shift state.

n specifies the maximum number of wide characters to be stored.

RETURN VALUE

If the multibyte character sequence is valid, mbstowcs returns the number of elements of pwcs that were modified, excluding the terminating 0 code, if any. If the sequence of multibyte characters is invalid, mbstowcs returns -1.

CAUTIONS

No multibyte characters that follow a null character are examined or converted. If the sequence you want to convert contains such a value in the middle, you should use a loop that calls mbtowc.

If copying takes place between objects that overlap, the behavior of mbstowcs is undefined.

A diagnostic is not issued if mbstowcs encounters invalid data; a return value of -1 is the only indication of an error.

EXAMPLE

This example replaces all occurences of a given character in a mixed DBCS string. The string is assumed to have a maximum length of 81 characters. This example uses mbstowcs and wcstomb.

 #include <locale.h>
 #include <limits.h>
 #include <stdlib.h>
 #include <stdio.h>

 #define MAX_CHARACTERS 81
    /* "old_string" is the input MIXED DBCS string. "new_string" */ 
    /* is the output MIXED DBCS string. "old_wchar" is the       */ 
    /* multibyte character to be  replaced. "new_wchar" is the   */ 
    /* multibyte character to replace with.                      */ 
 void mbsrepl(char *old_string, char *new_string,
              wchar_t old_wchar, wchar_t new_wchar)
 {
    wchar_t work[MAX_CHARACTERS];
    int nchars;
    int i;

       /* Inform library that we will be accepting a DBCS string.*/ 
       /* That is, SO and SI are not regular control characters: */ 
       /* they indicate a change in shift state.                 */ 
    setlocale(LC_ALL, "dbcs");

    nchars = mbstowcs(work, old_string, MAX_CHARACTERS);
    if (nchars < 0) {
       fputs("Invalid DBCS string.\n", stderr);
       fclose(stderr);
       abort();
    }

       /* Perform the actual substitution.                       */ 
       for (i = 0; i < nchars; i++)
           if (work[i] ==  old_wchar)
               work[i] = new_wchar;

       /* Convert back to MIXED format.                          */ 
    nchars = wcstombs(new_string, work, MAX_CHARACTERS);

       /* See if the replacement caused the string to overflow.  */ 
    if (nchars == MAX_CHARACTERS) {
       fputs("Replacement string too large.\n", stderr);
       abort();
       fclose(stderr);
    }
 }

mbtowc -- Convert a Multibyte Character to a Wide Character

SYNOPSIS

 #include <stdlib.h>

 int mbtowc(wchar_t *pwc, const char *s, size_t n);

DESCRIPTION

mbtowc determines how many bytes are needed to represent the multibyte character pointed to by s. If s is not NULL, mbtowc then stores the corresponding wide character in the array pointed to by pwc.

n specifies the maximum number of bytes to examine in the array pointed to by pwc.

RETURN VALUE

If s is not NULL, the return value is as follows:

0: is returned if s points to the null character.
length of the multibyte character: is returned if the next n or fewer bytes constitute a valid multibyte character.
-1: is returned if the next n or fewer bytes do not constitute a valid multibyte character.

If s is NULL, the return value is as follows:

nonzero value: is returned if the current locale supports state-dependent encodings.
0: is returned if the current locale does not support state-dependent encodings.

CAUTIONS

A diagnostic is not issued if mbtowc encounters invalid data; a return value of -1 is the only indication of an error.

EXAMPLE

This example finds a multibyte character in a mixed DBCS string using mbtowc.

 #include <locale.h>
 #include <limits.h>
 #include <stdlib.h>
 #include <stdio.h>

    /* "begstr" points to the beginning of a DBCS MIXED string.  */ 
    /*  "mbc_sought" is the character value we're looking for.   */ 
 int mbfind(char *begstr, wchar_t int mbc_sought)
 {
    int mbclen;        /* length (in bytes) of current character */ 
    wchar_t mbc;       /* value of current character             */ 
    char *strptr;      /* pointer to current location in string  */ 
    strptr = begstr;

       /* Inform library that we will be accepting a DBCS string.*/ 
       /* That is, SO and SI are not regular control characters: */ 
       /* they indicate a change in shift state.                 */ 
    setlocale(LC_ALL, "dbcs");

       /* Reset to initial shift state. (A valid mixed string    */ 
       /* must begin in initial shift state).                    */ 
    mbtowc((wchar_t *)NULL, NULL, 0);

       /* One loop iteration per character. Advance "strptr" by  */ 
       /* number of bytes consumed.                              */ 
    while (mbclen = mbtowc(&mbc, strptr, MB_LEN_MAX)) {
       if (mbclen < 0) {
          fputs("Invalid pure DBCS string\n", stderr);
          abort();
       }
       if (mbc == mbc_sought)
          break;
       strptr += mbclen;
    }

       /* Last character was not '\0' -- must have found it     */ 
    if (mbclen) {
       printf("MBFIND: found at byte offset %d\n", strptr - begstr);
       return 1;
    }
    else {
       puts("MBFIND: character not found\n");
       return 0;
    }
 }

wcstombs -- Convert a Wide Character Sequence to a Multibyte Sequence

SYNOPSIS

 #include <stdlib.h>

 size_t wcstombs(char *s, const wchar_t *pwcs, size_t n);

DESCRIPTION

wcstombs converts a sequence of wide characters (pure DBCS sequence) to a sequence of multibyte characters (mixed DBCS sequence). The wide characters are in the array pointed to by pwcs, and the resulting multibyte characters are stored in the array pointed to by s. The resulting multibyte character sequence begins in the initial shift state.

n specifies the maximum number of bytes to be filled with multibyte characters. The conversion stops if a multibyte character would exceed the limit of n bytes or if a null character is stored.

RETURN VALUE

If the multibyte character sequence is valid, wcstombs returns the number of bytes of s that were modified, excluding the terminating 0 byte, if any. If the sequence of multibyte characters is invalid, wcstombs returns -1.

CAUTIONS

If copying takes place between objects that overlap, the behavior of wcstombs is undefined.

A diagnostic is not issued if wcstombs encounters invalid data; a return value of -1 is the only indication of an error.

EXAMPLE

See the example for mbstowcs.

wctomb -- Convert a Wide Character to a Multibyte Character

SYNOPSIS

 #include <stdlib.h>

 int wctomb(char *s, wchar_t wchar);

DESCRIPTION

wctomb determines how many bytes are needed to represent the multibyte character corresponding to the wide (pure DBCS) character whose value is wchar, including any change in shift state. It stores the multibyte character representation in the array pointed to by s, assuming s is not NULL. If the value of wchar is 0, wctomb is left in the initial shift state.

RETURN VALUE

If s is not NULL, the return value is the number of bytes that make up the multibyte character corresponding to the value of wchar.

If s is NULL, the return value is as follows:

nonzero value: is returned if the current locale supports state-dependent encodings.
0: is returned if the current locale does not support state-dependent encodings.

The return value is never greater than the value of the MB_CUR_MAX macro.

CAUTIONS

A diagnostic is not issued if wctomb encounters invalid data; a return value of -1 is the only indication of an error.

EXAMPLE

This example converts a PURE DBCS string to MIXED, stopping at the first new-line character. This example uses wctomb.

 #include <stdlib.h>
 #include <locale.h>
 #include <limits.h>
 #include <stdlib.h>
 #include <stdio.h>

 #define MAX_CHARACTERS 81

    /* "pure_string" is the input PURE DBCS string.               */ 
    /* "mixed_string" the output MIXED DBCS string.               */ 
 void mbline(wchar_t *pure_string, char *mixed_string)
 {
    int i;
    int mbclen;
    wchar_t wc;

       /* Inform library that we will be accepting a DBCS string. */ 
       /* That is, SO and SI are not regular control characters:  */ 
       /* they indicate a change in shift state.                  */ 
    setlocale(LC_ALL, "dbcs");

    wctomb(NULL, 0);             /* Reset to initial shift state. */ 

       /* One loop iteration per character. Advance "mixed_string"*/ 
       /* by number of bytes in character.                        */ 
    i = 0;
    do {
       wc = pure_string[i++];
       mbclen = wctomb(mixed_string, wc);
       if (mbclen < 0) {
          puts("Invalid PURE DBCS string.\n");
          abort();
          fclose(stdout);
       }
       mixed_string += mbclen;
    } while (wc != L'\n');

    *mixed_string = '\0';
 }