The collating sequence is the order in which characters
are sorted. For example, when the SORT procedure is executed, the
collating sequence determines the sort order (higher, lower, or equal
to) of a particular character in relation to other characters.
The default collating
sequence is binary collation, which sorts characters according to
each character's location in the code page of the session encoding.
(The session encoding is the default encoding for a SAS session. The
default encoding can be specified by using various SAS language elements.)
The sort order corresponds directly to the arrangement of the code
points within the code page. The two single-byte character encoding
methods that data processing uses most widely are ASCII and EBCDIC.
The OpenVMS, UNIX, and Windows operating environments use ASCII encodings;
IBM mainframe computers use EBCDIC encodings.
Binary collation is
the fastest type of collation because it is the most efficient for
the computer. However, locating characters within a binary-collated
report might be difficult if you are not familiar with this method.
For example, a binary-collated report lists words beginning with uppercase
characters separately from words beginning with lowercase characters,
and words beginning with accented characters after words beginning
with unaccented characters. Therefore, for ASCII-based encodings,
the capital letter
Z
precedes the lowercase
letter
a
. Similarly, for EBCDIC-based
encodings, the lowercase letter
z
precedes
the capital letter
A
.
You can request an alternate
collating sequence that overrides the binary collation. To request
an alternate collating sequence, specify one of the following sequences:
Results of Different Collating Sequences
The first column shows
the results of binary collation on characters that are represented
in an ASCII-based encoding. The alphabetization is not consistent
because of the separate grouping of words that begin with uppercase
and lowercase characters. For example, the word Zeus
appears before aardvark because
of the code points that are assigned to the characters within the
ASCII-based encoding.
The second column shows
the results of specifying a translation table that alternates the
ordering of lowercase and uppercase characters. If you use the translation
table, the word aardvark appears
before Zeus. However, the word
azimuth appears before Aaron
because the translation table assigns a weight value to the lowercase
character
a
that is less than the weight
value of the uppercase character
A
.
In addition, accents are sorted from left to right. For example, coté
comes before côte.
The third column shows
the results of specifying the ASCII-based, double-byte latin1 encoding.
The last column shows
the results of linguistic collation for the session locale fr_FR (French_France),
which uses a collation algorithm to alphabetize words. The algorithm
specifies that words beginning with lowercase characters appear before
words beginning with uppercase characters. In addition, this linguistic
collation sorts accents from right to left because of the French locale
specification.
SAS has adopted the
International Components for Unicode (ICU) to implement linguistic
collation. The ICU and its implementation of the Unicode Collation
Algorithm (UCA) have become a standard. The collating sequence is
the default provided by the ICU for the specified locale.