Previous Page | Next Page

Encoding for NLS

Common Encoding Methods

The encoding methods result from standards developed by various computer hardware manufacturers and standards organizations. For more information, see Standards Organizations for NLS Encodings. The common encoding methods are listed here:

ASCII (American Standard Code for Information Interchange)

is a 7-bit encoding for the United States that provides 128 character combinations. The encoding contains characters for uppercase and lowercase English, American English punctuation, base 10 numbers, and a few control characters. This set of 128 characters is common to most other encodings. ASCII is used by personal computers.

EBCDIC (Extended Binary Coded Decimal Interchange Code) family

is an 8-bit encoding that provides 256 character combinations. There are multiple EBCDIC-based encodings. EBCDIC is used on IBM mainframes and most IBM mid-range computers. EBCDIC follows ISO 646 conventions to facilitate translations between EBCDIC encodings and 7-bit (and 8-bit) ASCII-based encodings. The 95 EBCDIC graphical characters include 82 invariant characters (including a blank space), which occupy the same code positions across most EBCDIC single-byte code pages, and also includes 13 variant graphic characters, which occupy varying code positions across most EBCDIC single-byte code pages. For details about variant characters, see Code Point Discrepancies among EBCDIC Encodings.

ISO (International Organization for Standardization) 646 family

is a 7-bit encoding that is an international standard and provides 128 character combinations. The ISO 646 family of encodings is similar to ASCII except that it has 12 code points for national variants. The 12 national variants represent specific characters that are needed for a particular language.

ISO 8859 family and Windows family

is an 8-bit extension of ASCII that supports all of the ASCII code points and adds 12 more, providing 256 character combinations. Latin1, which is officially named ISO-8859-1, is the most frequently used member of the ISO 8859 family of encodings. In addition to the ASCII characters, Latin1 contains accented characters, other letters needed for languages of Western Europe, and some special characters. HTTP and HTML protocols are based on Unicode.

Unicode

provides up to 99,024 character combinations. Unicode can accommodate basically all of the world's languages.

There are three Unicode encoding forms:

UTF-8

is an MBCS encoding that contains the Latin-script languages, Greek, Cyrillic, Arabic, and Hebrew, and East Asian languages such as Japanese, Chinese and Korean. The characters in UTF-8 are of varying width, from one to four bytes. UTF-8 maintains ASCII compatibility by preserving the ASCII characters in code positions 1 through 128.

UTF-16

is a 16-bit form that contains all of the most common characters in all modern writing systems. Most of the characters are uniformly represented with two bytes, although there is extended space, called surrogate space, for additional characters that require four bytes.

UTF-32

is a 32-bit form whose characters each occupy 4 bytes.

Other encodings

The ISO 8859 family has other members that are designed for other languages. The following table describes the other encodings that are approved by ISO.

Other Encodings Approved by ISO
ISO Standard Name of Encoding Description
ISO 8859-1 Latin 1 US and West European
ISO 8859-2 Latin 2 Central and East European
ISO 8859-3 Latin 3 South European, Maltese and Esperanto
ISO 8859-4 Baltic North European
ISO 8859-5 Cyrillic Slavic languages
ISO 8859-6 Arabic Arabic
ISO 8859-7 Greek Modern Greek
ISO 8859-8 Hebrew Hebrew and Yiddish
ISO 8859-9 Turkish Turkish
ISO 8859-10 Latin 6 Nordic (Inuit, Sámi, Icelandic)
ISO 8859-11 Latin/Thai Thai
ISO 8859-13 Latin 7 Baltic Rim
ISO 8859-14 Latin 8 Celtic
ISO 8859-15 Latin 9 West European and Albanian

Additionally, a number of encoding standards have been developed for East Asian languages, some of which are listed in the following table.

Some East Asian Language Encodings Approved by ISO
Standard Name of Encoding Description
GB 2312-80 Simplified Chinese People's Republic of China
CNS 11643 Traditional Chinese Taiwan
Big-5 Traditional Chinese Taiwan
KS C 5601 Korean National Standard Korea
JIS Japan Industry Standard Japan
Shift-JIS Japan Industry Standard multibyte encoding Japan

There are other encodings in the standards for EBCDIC and Windows that support different languages and locales.

Previous Page | Next Page | Top of Page