Collating Sequence Option
Specifies the collating sequence for PROC SORT.
Valid in: |
PROC SORT statement |
Note: |
The PROC SORT statement sorts observations in a SAS data
set by one or more characters or numeric variables.
|
Syntax
PROC SORT collating-sequence-option <other option(s)> ;
Options
Options can include
one
collating-sequence-option and multiple
other options. The order of the two types of options does not matter and both
types are not necessary in the same PROC SORT step. Only the explanations
for the PROC SORT collating-sequence-options follow.
Operating Environment Information: For information about behavior specific to your operating environment
for the DANISH, FINNISH, NORWEGIAN, or SWEDISH
collating-sequence-option, see the SAS documentation
for your operating environment.
- ASCII
-
sorts character variables
using the ASCII collating sequence. You need this option only when
you want to achieve an ASCII ordering on a system where EBCDIC is
the native collating sequence.
- DANISH NORWEGIAN
-
sorts characters according
to the Danish and Norwegian
- EBCDIC
-
sorts character variables
using the EBCDIC collating sequence. You need this option only when
you want to achieve an EBCDIC ordering on a system where ASCII is
the native collating sequence.
- POLISH
-
sorts characters according
to the Polish convention.
- FINNISH SWEDISH
-
- NATIONAL
-
sorts character variables
using an alternate collating sequence, as defined by your installation,
to reflect a country's National Use Differences. To use this option,
your site must have a customized national sort sequence defined.
Check with the SAS Installation Representative at your site to determine
whether a customized national sort sequence is available.
- NORWEGIAN
-
See DANISH
- SWEDISH
-
See FINNISH
- SORTSEQ=collating-sequence
-
specifies the collating
sequence. The
collating-sequence can be a collating-sequence-option, a translation table, an encoding,
or the keyword LINGUISTIC. Only one collating sequence can be specified.
For detailed information, refer to
Collating Sequence.
Here are descriptions
of the collating sequences:
- collating—sequence—option
| translation_table
-
specifies either a
translation table, which can be one that SAS provides or any user-defined
translation table, or one of the PROC SORT statement Collating-Sequence-Options.
For an example of using PROC TRANTAB and PROC SORT with SORTSEQ=,
see
Using Different Translation Tables for Sorting.
The available translation
tables are
The following figure
shows how the alphanumeric characters in each language will sort:
Restriction:You can specify only one collating-sequence-option in
a PROC SORT step.
Tip:The SORTSEQ= collating sequence options are specified
without parenthesis and have no arguments associated with them. An
example of how to specify a collating sequence follows: proc sort data=mydata SORTSEQ=ASCII;
- encoding-value
-
Restriction:PROC SORT is the only procedure or part of the SAS system
that recognizes an encoding specified for the SORTSEQ= option.
Tip:When the encoding value contains a character other than
an alphanumeric character or underscore, the value needs to be enclosed
in quotation marks.
- LINGUISTIC<(collating—rules
)>
-
specifies linguistic
collation, which sorts characters according to rules of the specified
language. The rules and default collating sequence options are based
on the language specified in the current locale setting. The implementation
is provided by the International Components for Unicode (ICU) library
and produces results that are largely compatible with the Unicode
Collation Algorithms (UCA).
Alias:UCA
Restriction:The SORTSEQ=LINGUISTIC option is available only on the
PROC SORT SORTSEQ= option and is not available for the SAS System
SORTSEQ= option.
Tips:LINGUISTIC sorting requires more memory with the z/OS
mainframe. You might need to set your REGION to 50M or higher. This
action must be done in JCL, if you are running in batch mode, or in
the VERIFY screen if you are running interactively. This action allows
the ICU libraries to load properly and does not affect the memory
that is used for sorting.
The collating-rules must be enclosed in parentheses. More
than one collating rule can be specified.
When BY processing is performed on data sets that are
sorted with linguistic collation, the NOBYSORTED system
option might need to be specified in order for the data set to be
treated properly. BY processing is performed differently than collating
sequence processing.
The Collating Sequence for detailed information about linguistic collation.
Refer to http://www.unicode.org Web site for the Unicode Collation Algorithm (UCA) specification.
The following are the
collation-rules that can be specified for the LINGUISTIC option. These
rules modify the linguistic collating sequence:
- ALTERNATE_HANDLING=SHIFTED
-
controls the handling
of variable characters like spaces, punctuation, and symbols. When
this option is not specified (using the default value Non-Ignorable),
differences among these variable characters are of the same importance
as differences among letters. If the ALTERNATE_HANDLING option is
specified, these variable characters are of minor importance.
Default:NON_IGNORABLE
Tip:The SHIFTED value is often used in combination with STRENGTH=
set to Quaternary. In such a case, whitespace characters, punctuation,
and symbols are considered when comparing strings, but only if all
other aspects of the strings (base letters, accents, and case) are
identical.
- CASE_FIRST=
-
specify order of uppercase
and lowercase letters. This argument is valid for only TERTIARY, QUATERNARY,
or IDENTICAL levels. The following table provides the values and information
for the CASE_FIRST argument:
|
|
|
Sorts uppercase letters
first, then the lowercase letters.
|
|
Sorts lowercase letters
first, then the uppercase letters.
|
- COLLATION=
-
The following table
lists the available COLLATION= values: If you do not select a collation
value, then the user's locale-default collation is selected.
|
|
|
specifies Pinyin ordering
for Latin and specifies big5 charset ordering for Chinese, Japanese,
and Korean characters.
|
|
specifies a Hindi variant.
|
|
specifies Pinyin ordering
for Latin and specifies gb2312han charset ordering for Chinese, Japanese,
and Korean characters.
|
|
specifies a telephone-book
style for ordering of characters. Select PHONEBOOK only with the German
language.
|
|
specifies an ordering
for Chinese, Japanese, and Korean characters based on character-by-character
transliteration into Pinyin.This ordering is typically used with simplified
Chinese.
|
|
is the Portable Operating
System Interface. This option specifies a "C" locale ordering of characters.
|
|
specifies a nonalphabetic
writing style ordering of characters. Select STROKE with Chinese,
Japanese, Korean, or Vietnamese languages. This ordering is typically
used with Traditional Chinese.
|
|
specifies a traditional
style for ordering of characters. For example, select TRADITIONAL
with the Spanish language.
|
- LOCALE=locale_name
-
Restriction:The following locales are not supported by PROC SORT:
-
Afrikaans_SouthAfrica, af_ZA
-
Cornish_UnitedKingdom, kw_GB
-
ManxGaelic_UnitedKingdom, gv_GB
- NUMERIC_COLLATION=
-
orders integer values
within the text by the numeric value instead of characters used to
represent the numbers.
|
|
|
Order numbers by the
numeric value. For example, "8 Main St." would sort before "45 Main
St.".
|
|
Order numbers by the
character value. For example, "45 Main St." would sort before "8 Main
St.".
|
- STRENGTH=
-
The value of strength
is related to the collation level. There are five collation-level
values. The following table provides information about the five levels.
The default value for strength is related to the locale.
|
|
|
|
PRIMARY specifies differences
between base characters (for example, "a" < "b").
|
It is the strongest
difference. For example, dictionaries are divided into different sections
by base character.
|
|
Accents in the characters
are considered secondary differences (for example, "as" < "às"
< "at").
|
A secondary difference
is ignored when there is a primary difference anywhere in the strings.
Other differences between letters can also be considered secondary
differences, depending on the language.
|
|
Upper and lowercase
differences in characters are distinguished at the tertiary level
(for example, "ao" < "Ao" < "aò").
|
A tertiary difference
is ignored when there is a primary or secondary difference anywhere
in the strings. Another example is the difference between large and
small Kana.
|
|
When punctuation is
ignored at level 1-3, an additional level can be used to distinguish
words with and without punctuation (for example, "ab" < "a-b" <
"aB").
|
The quaternary level
should be used if ignoring punctuation is required or when processing
Japanese text. This difference is ignored when there is a primary,
secondary or tertiary difference.
|
|
When all other levels
are equal, the identical level is used as a tiebreaker. The Unicode
code point values of the Normalization Form D (NFD) form of each string
are compared at this level, just in case there is no difference at
levels 1-4.
|
This level should be
used sparingly, as only code point values differences between two
strings is an extremely rare occurrence. For example, only Hebrew
cantillation marks are distinguished at this level.
|
CAUTION:
If you
use a host sort utility to sort your data, then specifying a translation
table based collating sequence with the SORTSEQ= option might corrupt
the character BY variables. For more information, see the PROC SORT
documentation for your operating environment.
Details
The collating sequence
option in the PROC SORT statement sorts observations in a SAS data
set by one or more characters or numeric variables.
Options
|
|
Specify the collating
sequence
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Specify a customized
sequence
|
|
|
Specify any of the collating
sequences listed above (ASCII, EBCDIC, DANISH, FINNISH, ITALIAN, NORWEGIAN,
POLISH, SPANISH, SWEDISH, or NATIONAL), the name of any other system
provided translation table (POLISH, SPANISH), and the name of a user-created
translation table. You can specify an encoding. You can also specify
either the keyword LINGUISTIC or UCA to achieve a locale-appropriate
collating sequence.
|
|