Collating Sequence Option

Specifies the collating sequence for PROC SORT.

Valid in:	PROC SORT statement
Note:	The PROC SORT statement sorts observations in a SAS data set by one or more characters or numeric variables.

Syntax

Details

Syntax

PROC SORT collating-sequence-option <other option(s)> ;

Options

Options can include one collating-sequence-option and multiple other options. The order of the two types of options does not matter and both types are not necessary in the same PROC SORT step. Only the explanations for the PROC SORT collating-sequence-options follow.

Operating Environment Information: For information about behavior specific to your operating environment for the DANISH, FINNISH, NORWEGIAN, or SWEDISH collating-sequence-option, see the SAS documentation for your operating environment.

ASCII: sorts character variables using the ASCII collating sequence. You need this option only when you want to achieve an ASCII ordering on a system where EBCDIC is the native collating sequence.

DANISH NORWEGIAN: sorts characters according to the Danish and Norwegian

The Danish and Norwegian collating sequence is shown in Alphanumeric Characters Sorted for Each Language .

EBCDIC: sorts character variables using the EBCDIC collating sequence. You need this option only when you want to achieve an EBCDIC ordering on a system where ASCII is the native collating sequence.

POLISH: sorts characters according to the Polish convention.

FINNISH SWEDISH: sorts characters according to the Finnish and Swedish convention. The Finnish and Swedish collating sequence is shown in Alphanumeric Characters Sorted for Each Language .

NATIONAL: sorts character variables using an alternate collating sequence, as defined by your installation, to reflect a country's National Use Differences. To use this option, your site must have a customized national sort sequence defined. Check with the SAS Installation Representative at your site to determine whether a customized national sort sequence is available.

NORWEGIAN: See DANISH

SWEDISH: See FINNISH

SORTSEQ=collating-sequence

specifies the collating sequence. The collating-sequence can be a collating-sequence-option, a translation table, an encoding, or the keyword LINGUISTIC. Only one collating sequence can be specified. For detailed information, refer to Collating Sequence.

Here are descriptions of the collating sequences:

collating—sequence—option | translation_table

specifies either a translation table, which can be one that SAS provides or any user-defined translation table, or one of the PROC SORT statement Collating-Sequence-Options. For an example of using PROC TRANTAB and PROC SORT with SORTSEQ=, see Using Different Translation Tables for Sorting.

The available translation tables are

ASCII
DANISH
EBCDIC
FINNISH
ITALIAN
NORWEGIAN
POLISH
REVERSE
SPANISH
SWEDISH

The following figure shows how the alphanumeric characters in each language will sort:

Alphanumeric Characters Sorted for Each Language

National Collation Sequences of Alphanumeric Characters

Restriction:You can specify only one collating-sequence-option in a PROC SORT step.

Tip:The SORTSEQ= collating sequence options are specified without parenthesis and have no arguments associated with them. An example of how to specify a collating sequence follows: proc sort data=mydata SORTSEQ=ASCII;

encoding-value: specifies an encoding value. The result is the same as a binary collation of the character data represented in the specified encoding. See the supported encoding values in SBCS, DBCS, and Unicode Encoding Values for Transcoding Data.

Restriction:PROC SORT is the only procedure or part of the SAS system that recognizes an encoding specified for the SORTSEQ= option.

Tip:When the encoding value contains a character other than an alphanumeric character or underscore, the value needs to be enclosed in quotation marks.

See:The list of the encodings that can be specified in SBCS, DBCS, and Unicode Encoding Values for Transcoding Data.

LINGUISTIC<(collating—rules )>: specifies linguistic collation, which sorts characters according to rules of the specified language. The rules and default collating sequence options are based on the language specified in the current locale setting. The implementation is provided by the International Components for Unicode (ICU) library and produces results that are largely compatible with the Unicode Collation Algorithms (UCA).

Alias:UCA

Restriction:The SORTSEQ=LINGUISTIC option is available only on the PROC SORT SORTSEQ= option and is not available for the SAS System SORTSEQ= option.

Tips:LINGUISTIC sorting requires more memory with the z/OS mainframe. You might need to set your REGION to 50M or higher. This action must be done in JCL, if you are running in batch mode, or in the VERIFY screen if you are running interactively. This action allows the ICU libraries to load properly and does not affect the memory that is used for sorting.

The collating-rules must be enclosed in parentheses. More than one collating rule can be specified.

When BY processing is performed on data sets that are sorted with linguistic collation, the NOBYSORTED system option might need to be specified in order for the data set to be treated properly. BY processing is performed differently than collating sequence processing.

See:The ICU License - ICU 1.8.1 and later in Base SAS Procedures Guide

The Collating Sequence for detailed information about linguistic collation.

Refer to http://www.unicode.org Web site for the Unicode Collation Algorithm (UCA) specification.

The following are the collation-rules that can be specified for the LINGUISTIC option. These rules modify the linguistic collating sequence:

ALTERNATE_HANDLING=SHIFTED: controls the handling of variable characters like spaces, punctuation, and symbols. When this option is not specified (using the default value Non-Ignorable), differences among these variable characters are of the same importance as differences among letters. If the ALTERNATE_HANDLING option is specified, these variable characters are of minor importance.

Default:NON_IGNORABLE

Tip:The SHIFTED value is often used in combination with STRENGTH= set to Quaternary. In such a case, whitespace characters, punctuation, and symbols are considered when comparing strings, but only if all other aspects of the strings (base letters, accents, and case) are identical.

CASE_FIRST=

specify order of uppercase and lowercase letters. This argument is valid for only TERTIARY, QUATERNARY, or IDENTICAL levels. The following table provides the values and information for the CASE_FIRST argument:

Value	Description
UPPER	Sorts uppercase letters first, then the lowercase letters.
LOWER	Sorts lowercase letters first, then the uppercase letters.

COLLATION=

The following table lists the available COLLATION= values: If you do not select a collation value, then the user's locale-default collation is selected.

Value	Description
BIG5HAN	specifies Pinyin ordering for Latin and specifies big5 charset ordering for Chinese, Japanese, and Korean characters.
DIRECT	specifies a Hindi variant.
GB2312HAN	specifies Pinyin ordering for Latin and specifies gb2312han charset ordering for Chinese, Japanese, and Korean characters.
PHONEBOOK	specifies a telephone-book style for ordering of characters. Select PHONEBOOK only with the German language.
PINYIN	specifies an ordering for Chinese, Japanese, and Korean characters based on character-by-character transliteration into Pinyin.This ordering is typically used with simplified Chinese.
POSIX	is the Portable Operating System Interface. This option specifies a "C" locale ordering of characters.
STROKE	specifies a nonalphabetic writing style ordering of characters. Select STROKE with Chinese, Japanese, Korean, or Vietnamese languages. This ordering is typically used with Traditional Chinese.
TRADITIONAL	specifies a traditional style for ordering of characters. For example, select TRADITIONAL with the Spanish language.

LOCALE=locale_name

specifies the locale name in the form of a POSIX name. For example, ja_JP. See the LOCALE= Values and Default Settings for ENCODING, PAPERSIZE, DFLANG, and DATESTYLE Options for a list of locale and POSIX values supported by PROC SORT.

Restriction:The following locales are not supported by PROC SORT:

Afrikaans_SouthAfrica, af_ZA
Cornish_UnitedKingdom, kw_GB
ManxGaelic_UnitedKingdom, gv_GB

NUMERIC_COLLATION=

orders integer values within the text by the numeric value instead of characters used to represent the numbers.

Value	Description
ON	Order numbers by the numeric value. For example, "8 Main St." would sort before "45 Main St.".
OFF	Order numbers by the character value. For example, "45 Main St." would sort before "8 Main St.".

Default:OFF

STRENGTH=

The value of strength is related to the collation level. There are five collation-level values. The following table provides information about the five levels. The default value for strength is related to the locale.

Value	Type of Collation	Description
PRIMARY or 1	PRIMARY specifies differences between base characters (for example, "a" < "b").	It is the strongest difference. For example, dictionaries are divided into different sections by base character.
SECONDARY or 2	Accents in the characters are considered secondary differences (for example, "as" < "às" < "at").	A secondary difference is ignored when there is a primary difference anywhere in the strings. Other differences between letters can also be considered secondary differences, depending on the language.
TERTIARY or 3	Upper and lowercase differences in characters are distinguished at the tertiary level (for example, "ao" < "Ao" < "aò").	A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. Another example is the difference between large and small Kana.
QUATERNARY or 4	When punctuation is ignored at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, "ab" < "a-b" < "aB").	The quaternary level should be used if ignoring punctuation is required or when processing Japanese text. This difference is ignored when there is a primary, secondary or tertiary difference.
IDENTICAL or 5	When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the Normalization Form D (NFD) form of each string are compared at this level, just in case there is no difference at levels 1-4.	This level should be used sparingly, as only code point values differences between two strings is an extremely rare occurrence. For example, only Hebrew cantillation marks are distinguished at this level.

Alias:Level=

CAUTION:

If you use a host sort utility to sort your data, then specifying a translation table based collating sequence with the SORTSEQ= option might corrupt the character BY variables. For more information, see the PROC SORT documentation for your operating environment.

Details

The collating sequence option in the PROC SORT statement sorts observations in a SAS data set by one or more characters or numeric variables.

Options

Task		Option
Specify the collating sequence
	Specify ASCII	ASCII
	Specify EBCDIC	EBCDIC
	Specify Danish	DANISH
	Specify Finnish	FINNISH
	Specify Norwegian	NORWEGIAN
	Specify Polish	POLISH
	Specify Swedish	SWEDISH
	Specify a customized sequence	NATIONAL
	Specify any of the collating sequences listed above (ASCII, EBCDIC, DANISH, FINNISH, ITALIAN, NORWEGIAN, POLISH, SPANISH, SWEDISH, or NATIONAL), the name of any other system provided translation table (POLISH, SPANISH), and the name of a user-created translation table. You can specify an encoding. You can also specify either the keyword LINGUISTIC or UCA to achieve a locale-appropriate collating sequence.	SORTSEQ=

Collating Sequence Option

Syntax

Options

Details

See Also