Collating Sequence Option

Specifies the collating sequence for PROC SORT.
Valid in: PROC SORT statement
Note: The PROC SORT statement sorts observations in a SAS data set by one or more characters or numeric variables.

Syntax

PROC SORT collating-sequence-option <other option(s)> ;

Options

Options can include one collating-sequence-option and multiple other options. The order of the two types of options does not matter and both types are not necessary in the same PROC SORT step. Only the explanations for the PROC SORT collating-sequence-options follow.
Operating Environment Information: For information about behavior specific to your operating environment for the DANISH, FINNISH, NORWEGIAN, or SWEDISH collating-sequence-option, see the SAS documentation for your operating environment.
ASCII
sorts character variables using the ASCII collating sequence. You need this option only when you want to achieve an ASCII ordering on a system where EBCDIC is the native collating sequence.
DANISH NORWEGIAN
sorts characters according to the Danish and Norwegian
The Danish and Norwegian collating sequence is shown in Alphanumeric Characters Sorted for Each Language .
EBCDIC
sorts character variables using the EBCDIC collating sequence. You need this option only when you want to achieve an EBCDIC ordering on a system where ASCII is the native collating sequence.
POLISH
sorts characters according to the Polish convention.
FINNISH SWEDISH
sorts characters according to the Finnish and Swedish convention. The Finnish and Swedish collating sequence is shown in Alphanumeric Characters Sorted for Each Language .
NATIONAL
sorts character variables using an alternate collating sequence, as defined by your installation, to reflect a country's National Use Differences. To use this option, your site must have a customized national sort sequence defined. Check with the SAS Installation Representative at your site to determine whether a customized national sort sequence is available.
NORWEGIAN
See DANISH
SWEDISH
See FINNISH
SORTSEQ=collating-sequence
specifies the collating sequence. The collating-sequence can be a collating-sequence-option, a translation table, an encoding, or the keyword LINGUISTIC. Only one collating sequence can be specified. For detailed information, refer to Collating Sequence.
Here are descriptions of the collating sequences:
collating—sequence—option | translation_table
specifies either a translation table, which can be one that SAS provides or any user-defined translation table, or one of the PROC SORT statement Collating-Sequence-Options. For an example of using PROC TRANTAB and PROC SORT with SORTSEQ=, see Using Different Translation Tables for Sorting.
The available translation tables are
  • ASCII
  • DANISH
  • EBCDIC
  • FINNISH
  • ITALIAN
  • NORWEGIAN
  • POLISH
  • REVERSE
  • SPANISH
  • SWEDISH
The following figure shows how the alphanumeric characters in each language will sort:
Alphanumeric Characters Sorted for Each Language
National Collation Sequences of Alphanumeric Characters
Restriction:You can specify only one collating-sequence-option in a PROC SORT step.
Tip:The SORTSEQ= collating sequence options are specified without parenthesis and have no arguments associated with them. An example of how to specify a collating sequence follows: proc sort data=mydata SORTSEQ=ASCII;
encoding-value
specifies an encoding value. The result is the same as a binary collation of the character data represented in the specified encoding. See the supported encoding values in SBCS, DBCS, and Unicode Encoding Values for Transcoding Data.
Restriction:PROC SORT is the only procedure or part of the SAS system that recognizes an encoding specified for the SORTSEQ= option.
Tip:When the encoding value contains a character other than an alphanumeric character or underscore, the value needs to be enclosed in quotation marks.
See:The list of the encodings that can be specified in SBCS, DBCS, and Unicode Encoding Values for Transcoding Data.
LINGUISTIC<(collating—rules )>
specifies linguistic collation, which sorts characters according to rules of the specified language. The rules and default collating sequence options are based on the language specified in the current locale setting. The implementation is provided by the International Components for Unicode (ICU) library and produces results that are largely compatible with the Unicode Collation Algorithms (UCA).
Alias:UCA
Restriction:The SORTSEQ=LINGUISTIC option is available only on the PROC SORT SORTSEQ= option and is not available for the SAS System SORTSEQ= option.
Tips:LINGUISTIC sorting requires more memory with the z/OS mainframe. You might need to set your REGION to 50M or higher. This action must be done in JCL, if you are running in batch mode, or in the VERIFY screen if you are running interactively. This action allows the ICU libraries to load properly and does not affect the memory that is used for sorting.

The collating-rules must be enclosed in parentheses. More than one collating rule can be specified.

When BY processing is performed on data sets that are sorted with linguistic collation, the NOBYSORTED system option might need to be specified in order for the data set to be treated properly. BY processing is performed differently than collating sequence processing.

See:The ICU License - ICU 1.8.1 and later in Base SAS Procedures Guide

The Collating Sequence for detailed information about linguistic collation.

Refer to http://www.unicode.org Web site for the Unicode Collation Algorithm (UCA) specification.

The following are the collation-rules that can be specified for the LINGUISTIC option. These rules modify the linguistic collating sequence:
ALTERNATE_HANDLING=SHIFTED
controls the handling of variable characters like spaces, punctuation, and symbols. When this option is not specified (using the default value Non-Ignorable), differences among these variable characters are of the same importance as differences among letters. If the ALTERNATE_HANDLING option is specified, these variable characters are of minor importance.
Default:NON_IGNORABLE
Tip:The SHIFTED value is often used in combination with STRENGTH= set to Quaternary. In such a case, whitespace characters, punctuation, and symbols are considered when comparing strings, but only if all other aspects of the strings (base letters, accents, and case) are identical.
CASE_FIRST=
specify order of uppercase and lowercase letters. This argument is valid for only TERTIARY, QUATERNARY, or IDENTICAL levels. The following table provides the values and information for the CASE_FIRST argument:
Value
Description
UPPER
Sorts uppercase letters first, then the lowercase letters.
LOWER
Sorts lowercase letters first, then the uppercase letters.
COLLATION=
The following table lists the available COLLATION= values: If you do not select a collation value, then the user's locale-default collation is selected.
Value
Description
BIG5HAN
specifies Pinyin ordering for Latin and specifies big5 charset ordering for Chinese, Japanese, and Korean characters.
DIRECT
specifies a Hindi variant.
GB2312HAN
specifies Pinyin ordering for Latin and specifies gb2312han charset ordering for Chinese, Japanese, and Korean characters.
PHONEBOOK
specifies a telephone-book style for ordering of characters. Select PHONEBOOK only with the German language.
PINYIN
specifies an ordering for Chinese, Japanese, and Korean characters based on character-by-character transliteration into Pinyin.This ordering is typically used with simplified Chinese.
POSIX
is the Portable Operating System Interface. This option specifies a "C" locale ordering of characters.
STROKE
specifies a nonalphabetic writing style ordering of characters. Select STROKE with Chinese, Japanese, Korean, or Vietnamese languages. This ordering is typically used with Traditional Chinese.
TRADITIONAL
specifies a traditional style for ordering of characters. For example, select TRADITIONAL with the Spanish language.
LOCALE=locale_name
specifies the locale name in the form of a POSIX name. For example, ja_JP. See the LOCALE= Values and Default Settings for ENCODING, PAPERSIZE, DFLANG, and DATESTYLE Options for a list of locale and POSIX values supported by PROC SORT.
Restriction:The following locales are not supported by PROC SORT:
  • Afrikaans_SouthAfrica, af_ZA
  • Cornish_UnitedKingdom, kw_GB
  • ManxGaelic_UnitedKingdom, gv_GB
NUMERIC_COLLATION=
orders integer values within the text by the numeric value instead of characters used to represent the numbers.
Value
Description
ON
Order numbers by the numeric value. For example, "8 Main St." would sort before "45 Main St.".
OFF
Order numbers by the character value. For example, "45 Main St." would sort before "8 Main St.".
Default:OFF
STRENGTH=
The value of strength is related to the collation level. There are five collation-level values. The following table provides information about the five levels. The default value for strength is related to the locale.
Value
Type of Collation
Description
PRIMARY or 1
PRIMARY specifies differences between base characters (for example, "a" < "b").
It is the strongest difference. For example, dictionaries are divided into different sections by base character.
SECONDARY or 2
Accents in the characters are considered secondary differences (for example, "as" < "às" < "at").
A secondary difference is ignored when there is a primary difference anywhere in the strings. Other differences between letters can also be considered secondary differences, depending on the language.
TERTIARY or 3
Upper and lowercase differences in characters are distinguished at the tertiary level (for example, "ao" < "Ao" < "aò").
A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. Another example is the difference between large and small Kana.
QUATERNARY or 4
When punctuation is ignored at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, "ab" < "a-b" < "aB").
The quaternary level should be used if ignoring punctuation is required or when processing Japanese text. This difference is ignored when there is a primary, secondary or tertiary difference.
IDENTICAL or 5
When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the Normalization Form D (NFD) form of each string are compared at this level, just in case there is no difference at levels 1-4.
This level should be used sparingly, as only code point values differences between two strings is an extremely rare occurrence. For example, only Hebrew cantillation marks are distinguished at this level.
Alias:Level=
CAUTION:
If you use a host sort utility to sort your data, then specifying a translation table based collating sequence with the SORTSEQ= option might corrupt the character BY variables. For more information, see the PROC SORT documentation for your operating environment.

Details

The collating sequence option in the PROC SORT statement sorts observations in a SAS data set by one or more characters or numeric variables.
Options
Task
Option
Specify the collating sequence
Specify ASCII
Specify EBCDIC
Specify Danish
Specify Finnish
Specify Norwegian
Specify Polish
Specify Swedish
Specify a customized sequence
Specify any of the collating sequences listed above (ASCII, EBCDIC, DANISH, FINNISH, ITALIAN, NORWEGIAN, POLISH, SPANISH, SWEDISH, or NATIONAL), the name of any other system provided translation table (POLISH, SPANISH), and the name of a user-created translation table. You can specify an encoding. You can also specify either the keyword LINGUISTIC or UCA to achieve a locale-appropriate collating sequence.