Previous Page | Next Page

The SORT Procedure

Concepts: SORT Procedure


Multi-threaded Sorting

The SAS system option THREADS permits multi-threaded sorting, which is new with SAS System 9. Multi-threaded sorting achieves a degree of parallelism in the sorting operations. This parallelism is intended to reduce the real time to completion for a given operation and therefore limit the cost of additional CPU resources. For more information, see the section on "Support for Parallel Processing" in SAS Language Reference: Concepts.

Note:   The TAGSORT option does not support multi-threaded sorting.  [cautionend]

The performance of the multi-threaded sort will be affected by the value of the SAS system option CPUCOUNT=. CPUCOUNT= suggests how many system CPUs are available for use by the multi-threaded procedures.

For more information about THREADS and CPUCOUNT=, see the chapter on SAS system options in SAS Language Reference: Dictionary.


Sorting Orders for Numeric Variables

For numeric variables, the smallest-to-largest comparison sequence is

  1. SAS missing values (shown as a period or special missing value)

  2. negative numeric values

  3. zero

  4. positive numeric values.


Sorting Orders for Character Variables


Default Collating Sequence

The order in which alphanumeric characters are sorted is known as the collating sequence. This sort order is determined by the session encoding.

By default, PROC SORT uses either the EBCDIC or the ASCII collating sequence when it compares character values, depending on the environment under which the procedure is running.

Refer to the Collating Sequence chapter of the SAS National Language Support (NLS): Reference Guide for detailed information about the various collating sequences and when they are used.

Note:    ASCII and EBCDIC represent the family names of the session encodings. The sort order can be determined by referring to the encoding.   [cautionend]


EBCDIC Order

The z/OS operating environment uses the EBCDIC collating sequence.

The sorting order of the English-language EBCDIC sequence is consistent with the following sort order example.

blank . < ( + | & ! $ * ) ; ¬ - / , % _ > ?: # @ ' = "
a b c d e f g h i j k l m n o p q r ~ s t u v w x y z
{ A B C D E F G H I } J K L M N O P Q R \S T
U V W X Y Z
0 1 2 3 4 5 6 7 8 9

The main features of the EBCDIC sequence are that lowercase letters are sorted before uppercase letters, and uppercase letters are sorted before digits. Note also that some special characters interrupt the alphabetic sequences. The blank is the smallest character that you can display.


ASCII Order

The operating environments that use the ASCII collating sequence include

From the smallest to the largest character that you can display, the English-language ASCII sequence is consistent with the order shown in the following:

blank ! " # $ % & ' ( ) * + , - . /0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z[ \] ˆ_
a b c d e f g h i j k l m n o p q r s t u v w x y z { } ~

The main features of the ASCII sequence are that digits are sorted before uppercase letters, and uppercase letters are sorted before lowercase letters. The blank is the smallest character that you can display.


Specifying Sorting Orders for Character Variables

The options EBCDIC, ASCII, NATIONAL, DANISH, SWEDISH, and REVERSE specify collating sequences that are stored in the HOST catalog.

If you want to provide your own collating sequences or change a collating sequence provided for you, then use the TRANTAB procedure to create or modify translation tables. For complete details, see the TRANTAB procedure in SAS National Language Support (NLS): Reference Guide. When you create your own translation tables, they are stored in your PROFILE catalog, and they override any translation tables that have the same name in the HOST catalog.

Linguistic Collation, which sorts data according to rules of language, is supported in SAS System 9.2. Refer to the Collating Sequence chapter in SAS National Language Support (NLS): Reference Guide for detailed information about Linguistic Collation.

Note:   System managers can modify the HOST catalog by copying newly created tables from the PROFILE catalog to the HOST catalog. Then all users can access the new or modified translation table.  [cautionend]


Stored Sort Information

PROC SORT records the BY variables, collating sequence, and character set that it uses to sort the data set. This information is stored with the data set to help avoid unnecessary sorts.

Before PROC SORT sorts a data set, it checks the stored sort information. If you try to sort a data set the way that it is currently sorted, then PROC SORT does not perform the sort and writes a message to the log to that effect. To override this behavior, use the FORCE option. If you try to sort a data set the way that it is currently sorted and you specify an OUT= data set, then PROC SORT simply makes a copy of the DATA= data set.

To override the sort information that PROC SORT stores, use the _NULL_ value with the SORTEDBY= data set option. For more information about SORTEDBY=, see the chapter on SAS data set options in SAS Language Reference: Dictionary.

If you want to change the sort information for an existing data set, then use the SORTEDBY= data set option in the MODIFY statement in the DATASETS procedure. For more information, see MODIFY Statement.

To access the sort information that is stored with a data set, use the CONTENTS statement in PROC DATASETS. For more information, see CONTENTS Statement.


Presorted Input Data Sets

A new option, PRESORTED, has been added to the PROC SORT statement in the 9.2 version of SAS. Specifying the PRESORTED options prevents SAS from sorting an already sorted data set. Before sorting, SAS checks the sequence of observations within the input data set to determine whether the observations are in order. Use the PRESORTED option when you know or strongly suspect that a data set is already in order according to the key variables specified in the BY statement. The sequence of observations within the data set is checked by reading the data set and comparing the BY variables of each observation read to the BY variables of the preceding observation. This process continues until either the entire data set has been read or an out-of-sequence observation is detected.

If the entire data set has been read and no out-of-sequence observations have been found, then one of two actions is taken. If no output data set has been specified, the sort order metadata of the input data set is updated to indicate that the sequence has been verified. This verification notes that the data set is validly sorted according to the specified BY variables. Otherwise, if the observation sequence has been verified and an output data set is specified, the observations from the input data set are copied to the output data set, and the metadata for the output data set indicates that the data is validly sorted according to the BY variables.

If observations within the data set are not in sequence, then the data set will be sorted.

If the NODUPKEY option has been specified, then the sequence checking determines whether observations with duplicate keys are present in the data set. Otherwise, if the NODUPRECS option has been specified, then the sequence checking determines whether there are adjacent duplicate observations. The input data set is deemed not to be sorted if the NODUPKEY option is specified and observations with duplicate keys are detected. Likewise, the input data set is deemed not to be sorted if the NODUPRECS option is specified and adjacent duplicate observations are detected.

If the metadata of the input data set indicates that the data is already sorted according to the key variables listed in the BY statement and the input data set has been validated, then neither sequence checking nor sorting will be performed.

See Sorted Data Sets in the SAS Language Reference: Concepts and interactions with the SORTVALIDATE system option in SAS Language Reference: Dictionary.

Previous Page | Next Page | Top of Page