Previous Page | Next Page

Options for Commands, Statements, and Procedures for NLS

Collating Sequence Option



Specifies the collating sequence for PROC SORT.
Valid in: PROC SORT statement
PROC SORT statement: Sorts observations in a SAS data set by one or more characters or numeric variables

Syntax
Options
See Also

Syntax

PROC SORT collating-sequence-option <other option(s)>;

Options

Task Option
Specify the collating sequence

Specify ASCII ASCII

Specify EBCDIC EBCDIC

Specify Danish DANISH

Specify Finnish FINNISH

Specify Norwegian NORWEGIAN

Specify Polish POLISH

Specify Swedish SWEDISH

Specify a customized sequence NATIONAL

Specify any of the collating sequences listed above (ASCII, EBCDIC, DANISH, FINNISH, ITALIAN, NORWEGIAN, POLISH, SPANISH, SWEDISH, or NATIONAL), the name of any other system provided translation table (POLISH, SPANISH), and the name of a user-created translation table. You can specify an encoding. You can also specify either the keyword LINGUISTIC or UCA to achieve a locale-appropriate collating sequence. SORTSEQ=

Options can include one collating-sequence-option and multiple other options. The order of the two types of options does not matter and both types are not necessary in the same PROC SORT step. Only the explanations for the PROC SORT collating-sequence-options follow.

Operating Environment Information:   For information about behavior specific to your operating environment for the DANISH, FINNISH, NORWEGIAN, or SWEDISH collating-sequence-option, see the SAS documentation for your operating environment.  [cautionend]

ASCII

sorts character variables using the ASCII collating sequence. You need this option only when you want to achieve an ASCII ordering on a system where EBCDIC is the native collating sequence.

DANISH
NORWEGIAN

sorts characters according to the Danish and Norwegian convention.

The Danish and Norwegian collating sequence is shown in National Collating Sequences of Alphanumeric Characters.

EBCDIC

sorts character variables using the EBCDIC collating sequence. You need this option only when you want to achieve an EBCDIC ordering on a system where ASCII is the native collating sequence.

POLISH

sorts characters according to the Polish convention.

FINNISH
SWEDISH

sorts characters according to the Finnish and Swedish convention. The Finnish and Swedish collating sequence is shown in National Collating Sequences of Alphanumeric Characters.

NATIONAL

sorts character variables using an alternate collating sequence, as defined by your installation, to reflect a country's National Use Differences. To use this option, your site must have a customized national sort sequence defined. Check with the SAS Installation Representative at your site to determine whether a customized national sort sequence is available.

NORWEGIAN

See DANISH.

SWEDISH

See FINNISH.

SORTSEQ=collating-sequence

specifies the collating sequence. The collating-sequence can be a collating-sequence-option, a translation table, an encoding, or the keyword LINGUISTIC. Only one collating sequence can be specified. For detailed information, refer to Collating Sequence.

Here are descriptions of the collating sequences:

collating--sequence--option | translation_table

specifies either a translation table, which can be one that SAS provides or any user-defined translation table, or one of the PROC SORT statement Collating-Sequence-Options. For an example of using PROC TRANTAB and PROC SORT with SORTSEQ=, see Using Different Translation Tables for Sorting.

The available translation tables are

ASCII

DANISH

EBCDIC

FINNISH

ITALIAN

NORWEGIAN

POLISH

REVERSE

SPANISH

SWEDISH

The following figure shows how the alphanumeric characters in each language will sort.

National Collating Sequences of Alphanumeric Characters

[National Collating Sequences of Alphanumeric Characters]

Restriction: You can specify only one collating-sequence-option in a PROC SORT step.
Tip: The SORTSEQ= collating sequence options are specified without parenthesis and have no arguments associated with them. An example of how to specify a collating sequence follows:

proc sort data=mydata SORTSEQ=ASCII;
encoding-value

specifies an encoding value. The result is the same as a binary collation of the character data represented in the specified encoding. See the supported encoding values in SBCS, DBCS, and Unicode Encoding Values for Transcoding Data.

Restriction: PROC SORT is the only procedure or part of the SAS system that recognizes an encoding specified for the SORTSEQ= option.
Tip: When the encoding value contains a character other than an alphanumeric character or underscore, the value needs to be enclosed in quotation marks.
See: The list of the encodings that can be specified in SBCS, DBCS, and Unicode Encoding Values for Transcoding Data.
LINGUISTIC<(collating--rules )>

specifies linguistic collation, which sorts characters according to rules of the specified language. The rules and default collating sequence options are based on the language specified in the current locale setting. The implementation is provided by the International Components for Unicode (ICU) library and produces results that are largely compatible with the Unicode Collation Algorithms (UCA).

Alias: UCA
Restriction: The SORTSEQ=LINGUISTIC option is available only on the PROC SORT SORTSEQ= option and is not available for the SAS System SORTSEQ= option.
Restriction Note that linguistic collation is not supported on platforms VMS on Itanium (VMI) or 64-bit Windows on Itanium (W64).
Tip: LINGUISTIC sorting requires more memory with the z/OS mainframe. You might need to set your REGION to 50M or higher. This action must be done in JCL, if you are running in batch mode, or in the VERIFY screen if you are running interactively. This action allows the ICU libraries to load properly and does not affect the memory that is used for sorting.
Tip: The collating-rules must be enclosed in parentheses. More than one collating rule can be specified.
Tip: When BY processing is performed on data sets that are sorted with linguistic collation, the NOBYSORTED system option might need to be specified in order for the data set to be treated properly. BY processing is performed differently than collating sequence processing.
See: The ICU License agreement in the Base SAS Procedures Guide.
See: The Collating Sequence for detailed information on linguistic collation.
See Also: Refer to http://www.unicode.org Web site for the Unicode Collation Algorithm (UCA) specification.

The following are the collation-rules that can be specified for the LINGUISTIC option. These rules modify the linguistic collating sequence:

ALTERNATE_HANDLING=SHIFTED

controls the handling of variable characters like spaces, punctuation, and symbols. When this option is not specified (using the default value Non-Ignorable), differences among these variable characters are of the same importance as differences among letters. If the ALTERNATE_HANDLING option is specified, these variable characters are of minor importance.

Default: NON_IGNORABLE
Tip: The SHIFTED value is often used in combination with STRENGTH= set to Quaternary. In such a case, whitespace, punctuation, and symbols are considered when comparing strings, but only if all other aspects of the strings (base letters, accents, and case) are identical.
CASE_FIRST=

specify order of uppercase and lowercase letters. This argument is valid for only TERTIARY, QUATERNARY, or IDENTICAL levels. The following table provides the values and information for the CASE_FIRST argument:

Value Description
UPPER Sorts uppercase letters first, then the lowercase letters.
LOWER Sorts lowercase letters first, then the uppercase letters.

COLLATION=

The following table lists the available COLLATION= values: If you do not select a collation value, then the user's locale-default collation is selected.

Value Description
BIG5HAN specifies Pinyin ordering for Latin and specifies big5 charset ordering for Chinese, Japanese, and Korean characters.
DIRECT specifies a Hindi variant.
GB2312HAN specifies Pinyin ordering for Latin and specifies gb2312han charset ordering for Chinese, Japanese, and Korean characters.
PHONEBOOK specifies a telephone-book style for ordering of characters. Select PHONEBOOK only with the German language.
PINYIN specifies an ordering for Chinese, Japanese, and Korean characters based on character-by-character transliteration into Pinyin.This ordering is typically used with simplified Chinese.
POSIX is the Portable Operating System Interface. This option specifies a "C" locale ordering of characters.
STROKE specifies a nonalphabetic writing style ordering of characters. Select STROKE with Chinese, Japanese, Korean, or Vietnamese languages. This ordering is typically used with Traditional Chinese.
TRADITIONAL specifies a traditional style for ordering of characters. For example, select TRADITIONAL with the Spanish language.

LOCALE=locale_name

specifies the locale name in the form of a POSIX name. For example, ja_JP. See the Values for the LOCALE= System Option for a list of locale and POSIX values supported by PROC SORT.

Restriction: The following locales are not supported by PROC SORT:

Afrikaans_SouthAfrica, af_ZA

Cornish_UnitedKingdom, kw_GB

ManxGaelic_UnitedKingdom, gv_GB

NUMERIC_COLLATION=

orders integer values within the text by the numeric value instead of characters used to represent the numbers.

Value Description
ON Order numbers by the numeric value. For example, "8 Main St." would sort before "45 Main St.".
OFF Order numbers by the character value. For example, "45 Main St." would sort before "8 Main St.".

Default: OFF
STRENGTH=

The value of strength is related to the collation level. There are five collation-level values. The following table provides information about the five levels. The default value for strength is related to the locale.

Value Type of Collation Description
PRIMARY or 1 PRIMARY specifies differences between base characters (for example, "a" < "b"). It is the strongest difference. For example, dictionaries are divided into different sections by base character.
SECONDARY or 2 Accents in the characters are considered secondary differences (for example, "as" < "às" < "at"). A secondary difference is ignored when there is a primary difference anywhere in the strings. Other differences between letters can also be considered secondary differences, depending on the language.
TERTIARY or 3 Upper and lowercase differences in characters are distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. Another example is the difference between large and small Kana.
QUATERNARY or 4 When punctuation is ignored at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, "ab" < "a-b" < "aB"). The quaternary level should be used if ignoring punctuation is required or when processing Japanese text. This difference is ignored when there is a primary, secondary or tertiary difference.
IDENTICAL or 5 When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the Normalization Form D (NFD) form of each string are compared at this level, just in case there is no difference at levels 1-4. This level should be used sparingly, as only code point values differences between two strings is an extremely rare occurrence. For example, only Hebrew cantillation marks are distinguished at this level.

Alias: LEVEL=
CAUTION:
If you use a host sort utility to sort your data, then specifying a translation table based collating sequence with the SORTSEQ= option might corrupt the character BY variables.

For more information, see the PROC SORT documentation for your operating environment.  [cautionend]


See Also

Collating Sequence

Procedures

The SORT Procedure in Base SAS Procedures Guide.

System Options:

SORTSEQ= System Option: UNIX, Windows, and z/OS

TRANTAB= System Option

Previous Page | Next Page | Top of Page