SORT Procedure

PROC SORT Statement

Orders SAS data set observations by the values of one or more character or numeric variables.
Sorting by the Values of Multiple Variables

Maintaining the Relative Order of Observations in Each BY Group

Retaining the First Observation of Each BY Group

Syntax

Summary of Optional Arguments

specifies ASCII.
specifies the input data set.
sorts a SAS data set without changing the created and modified dates.
forces redundant sorting.
deletes the input data set before the replacement output data set is populated.
specifies whether the data set is likely already sorted.
reverses the collation order for character variables.
specifies the available memory.
reduces temporary disk usage.
Create output data sets
specifies the output data set to which duplicate observations are written.
specifies the output data set.
specifies the output data set for eliminated observations.
Eliminate duplicate observations
deletes observations with duplicate BY values.
Eliminate unique observations
eliminates observations from the output data set that have a unique sort key.
Override SAS system option THREADS
prevents multi-threaded sorting.
enables or prevents the activation of multi-threaded sorting.
Specify the collating sequence
specifies Danish.
specifies EBCDIC.
specifies Finnish.
specifies a customized sequence.
specifies Norwegian.
specifies Polish.
specifies the collating sequence.
specifies Swedish.
Specify the output order
specifies the relative order within BY groups.
does not maintain relative order within BY groups.

Collating-Sequence-Options

Operating Environment Information: For information about behavior specific to your operating environment for the DANISH, FINNISH, NORWEGIAN, or SWEDISH collating-sequence-option, see the SAS documentation for your operating environment.

You can specify only one collating-sequence-option and multiple other options in a PROC SORT step. The order of the two types of options does not matter and both types are not necessary in the same PROC SORT step.

DANISH
sorts characters according to the Danish and Norwegian convention.
The Danish and Norwegian collating sequence is shown in National Collating Sequences of Alphanumeric Characters.
ASCII
sorts character variables using the ASCII collating sequence. You need this option only when you want to achieve an ASCII ordering on a system where EBCDIC is the native collating sequence.
EBCDIC
sorts character variables using the EBCDIC collating sequence. You need this option only when you want to achieve an EBCDIC ordering on a system where ASCII is the native collating sequence.
POLISH
sorts characters according to the Polish convention.
FINNISH
sorts characters according to the Finnish and Swedish convention. The Finnish and Swedish collating sequence is shown in National Collating Sequences of Alphanumeric Characters.
NATIONAL
sorts character variables using an alternate collating sequence, as defined by your installation, to reflect a country's National Use Differences. To use this option, your site must have a customized national sort sequence defined. Check with the SAS Installation Representative at your site to determine whether a customized national sort sequence is available.
NORWEGIAN
See DANISH.
SWEDISH
See FINNISH.
SORTSEQ= collating-sequence
specifies any of the listed collating sequences (ASCII, EBCDIC, DANISH, FINNISH, ITALIAN, NORWEGIAN, POLISH, SPANISH, SWEDISH, or NATIONAL), the name of any other system provided translation table (POLISH, SPANISH), and the name of a user-created translation table. You can specify an encoding. You can also specify either the keyword LINGUISTIC or UCA to achieve a locale appropriate collating sequence. The collating-sequence can be a collating-sequence-option, a translation table, an encoding, or the keyword LINGUISTIC. Only one collating sequence can be specified. For detailed information, refer to Collating Sequence in SAS National Language Support (NLS): Reference Guide.
Here are descriptions of the collating sequences:
collating-sequence-option | translation_table
specifies either a translation table, which can be one that SAS provides or any user-defined translation table, or one of the PROC SORT statement collating-sequence-options. Refer to the PROC SORT statement Collating-Sequence-Options. For an example of using PROC TRANTAB and PROC SORT with SORTSEQ=, see Using Different Translation Tables for Sorting in SAS National Language Support (NLS): Reference Guide.
These are the available translation tables:
  • ASCII
  • DANISH
  • EBCDIC
  • FINNISH
  • ITALIAN
  • NORWEGIAN
  • POLISH
  • REVERSE
  • SPANISH
  • SWEDISH
The following figure shows how the alphanumeric characters in each language will sort.
National Collating Sequences of Alphanumeric Characters
National Collating Sequences of Alphanumeric Characters
Restriction:You can specify only one collating-sequence-option in a PROC SORT step.
Interaction:In-database processing will not occur when the SORTSEQ= option is specified.
Tips:The SORTSEQ= collating-sequence options are specified without parenthesis and have no arguments associated with them. An example of how to specify a collating sequence follows:

proc sort data=mydata SORTSEQ=ASCII;

encoding-value
specifies an encoding value. The result is the same as a binary collation of the character data represented in the specified encoding. See the supported encoding value in the SAS National Language Support (NLS): Reference Guide.
Restriction:PROC SORT is the only procedure or part of the SAS system that recognizes an encoding specified for the SORTSEQ= option.
Tip:When the encoding value contains a character other than an alphanumeric character or underscore, the value needs to be enclosed in quotation marks.
See:The list of the encodings that can be specified in the SAS National Language Support (NLS): Reference Guide.
LINGUISTIC<(collating-rules )>
specifies linguistic collation, which sorts characters according to rules of the specified language. The rules and default collating-sequence options are based on the language specified in the current locale setting. The implementation is provided by the International Components for Unicode (ICU) library. It produces results that are largely compatible with the Unicode Collation Algorithms (UCA).
The following are the collation-rules that can be specified for the LINGUISTIC option. These rules modify the linguistic collating sequence:
ALTERNATE_HANDLING=SHIFTED
controls the handling of variable characters like spaces, punctuation, and symbols. When this option is not specified (using the default value Non-Ignorable), differences among these variable characters are of the same importance as differences among letters. If the ALTERNATE_HANDLING option is specified, these variable characters are of minor importance.
Default:NON_IGNORABLE
Tip:The SHIFTED value is often used in combination with STRENGTH= set to Quaternary. In such a case, spaces, punctuation, and symbols are considered when comparing strings, but only if all other aspects of the strings (base letters, accents, and case) are identical.
CASE_FIRST=
specifies the order of uppercase and lowercase letters. This argument is valid for only TERTIARY, QUATERNARY, or IDENTICAL levels. The following table provides the values and information for the CASE_FIRST argument:
Arguments for CASE_FIRST=
Value
Description
UPPER
Sorts uppercase letters first, then the lowercase letters.
LOWER
Sorts lowercase letters first, then the uppercase letters.
COLLATION=
specifies character ordering. The following table lists the available COLLATION= values.
Note: If you do not select a collation value, then the user's locale-default collation is selected.
Values for COLLATION=
Value
Description
BIG5HAN
specifies Pinyin ordering for Latin and specifies bug5 charset ordering for Chinese, Japanese, and Korean characters..
DIRECT
specifies a Hindi variant.
GB21312HAN
specifies Pinyin ordering for Latin and specifies gb2312han charset ordering for Chinese, Japanese, and Korean characters.
PHONEBOOK
specifies a telephone-book style for ordering of characters. Select PHONEBOOK only with the German language.
PINYIN
specifies an ordering for Chinese, Japanese, and Korean characters based on character-by-character transliteration into Pinyin. This ordering is typically used with simplified Chinese.
POSIX
is the Portable Operating System Interface. This option specifies a “C” locale ordering of characters.
STROKE
specifies a nonalphabetic writing style ordering of characters. Select STROKE with Chinese, Japanese, Korean, or Vietnamese languages. This ordering is typically used with Traditional Chinese.
TRADITIONAL
specifies a traditional style for ordering of characters. For example, select TRADITIONAL with the Spanish Language.
LOCALE= locale_name
specifies the locale name in the form of a POSIX name (for example, ja_JP). For a list of locale and POSIX values supported by PROC SORT, see LOCALE= Values and Default Settings for ENCODING, PAPERSIZE, DFLANG, and DATESTYLE Options in SAS National Language Support (NLS): Reference Guide.
Restriction:The following Locales are not supported by PROC SORT:
  • Afrikaans_SouthAfrica, af_ZA
  • Cornish_UnitedKingdom, kw_GB
  • ManxGaelic_UnitedKingdom, gv_GB
NUMERIC_COLLATION=
orders integer values within the text by the numeric value instead of characters used to represent the numbers.
Values for NUMERIC_COLLATION
Value
Description
ON
Order numbers by the numeric value. For example, "8 Main St." would sort before "45 Main St.".
OFF
Order numbers by the character value. For example, "45 Main St." would sort before "8 Main St.".
Default:OFF
STRENGTH=
The value of strength is related to the collation level. There are five collation-level values. The following table provides information about the five levels. The default value for strength is related to the locale.
Values for STRENGTH=
Value
Type of Collation
Description
PRIMARY or 1
PRIMARY specifies differences between base characters (for example, "a" < "b").
It is the strongest difference. For example, dictionaries are divided into different sections by base character.
SECONDARY or 2
Accents in the characters are considered secondary differences (for example, "as" < "às" < "at").
A secondary difference is ignored when there is a primary difference anywhere in the strings. Other differences between letters can also be considered secondary differences, depending on the language.
TERTIARY or 3
Upper and lowercase differences in characters are distinguished at the tertiary level (for example, "ao" < "Ao" < "aò").
A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. Another example is the difference between large and small Kana.
QUATERNARY or 4
When punctuation is ignored at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, "ab" < "a-b" < "aB").
The quaternary level should be used if ignoring punctuation is required or when processing Japanese text. This difference is ignored when there is a primary, secondary, or tertiary difference.
IDENTICAL or 5
When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the Normalization Form D (NFD) form of each string are compared at this level, just in case there is no difference at levels 1-4.
This level should be used sparingly, because code-point value differences between two strings rarely occur. For example, only Hebrew cantillation marks are distinguished at this level.
Alias:LEVEL=
Alias:UCA
Restrictions:The SORTSEQ=LINGUISTIC option is available only on the PROC SORT SORTSEQ= option and is not available for the SAS System SORTSEQ= option.

Linguistic collation is not supported on platforms VMS on Itanium (VMI) or 64-bit Windows on Itanium (W64).

Tips:The collating-rules must be enclosed in parentheses. More than one collating rule can be specified.

When BY processing is performed on data sets that are sorted with linguistic collation, the NOBYSORTED system option might need to be specified in order for the data set to be treated properly. BY processing is performed differently than collating sequence processing.

The section on Specifying Linguistic Collation in SAS National Language Support (NLS): Reference Guide for detailed information.

The http://www.unicode.org Web site for the Unicode Collation Algorithm (UCA) specification.

CAUTION:
If you use a host sort utility to sort your data, then specifying a translation-table-based collating sequence with the SORTSEQ= option might corrupt the character BY variables.
For more information, see the PROC SORT documentation for your operating environment.

Other Options

Options can include one collating-sequence-option and multiple other options. The order of the two types of options does not matter and both types are not necessary in the same PROC SORT step.

DATA= SAS-data-set
identifies the input SAS data set.
Restriction:For in-database processing to occur, it is necessary that the data set refer to a table residing on the DBMS.
DATECOPY
copies the SAS internal date and time when the SAS data set was created and the date and time when it was last modified before the sort to the resulting sorted data set. Note that the operating environment date and time are not preserved.
Restriction:DATECOPY can be used only when the resulting data set uses the V8 or V9 engine.
Tip:You can alter the file creation date and time with the DTC= option in the MODIFY statement in PROC DATASETS. For more information, see MODIFY Statement.
DUPOUT= SAS-data-set
specifies the output data set to which duplicate observations are written.
Interactions:In-database processing does not occur when the DUPOUT= option is specified.

The DUPOUT= and UNIQUEOUT= options are not compatible and cannot be specified simultaneously.

Tips:The DUPOUT= option can be used only with the NODUPKEY option. It cannot be combined with the NOUNIQUEKEY option.

If the DUPOUT= data set name that is specified is the same as the INPUT data set name, SAS will not sort or overwrite the INPUT data set. Instead, SAS will generate an error message. The FORCE option must be specified in order to overwrite the INPUT data set with the DUPOUT= data set of the same name.

EQUALS | NOEQUALS
specifies the order of the observations in the output data set. For observations with identical BY-variable values, EQUALS maintains the relative order of the observations within the input data set in the output data set. NOEQUALS does not necessarily preserve this order in the output data set.
Default:EQUALS
Interactions:When you use NODUPKEY to remove observations in the output data set, the choice of EQUALS or NOEQUALS can affect which observations are removed.

The EQUALS | NOEQUALS procedure option overrides the default sort stability behavior that is established with the SORTEQUALS | NOSORTEQUALS system option.

The EQUALS option is supported by the multi-threaded sort. However, I/O performance might be reduced when using the EQUALS option with the multi-threaded sort because partitioned data sets are processed as if they consist of a single partition.

The NOEQUALS option is supported by the multi-threaded sort. The order of observations within BY groups that are returned by the multi-threaded sort might not be consistent between runs.

Tip:Using NOEQUALS can save CPU time and memory.
FORCE
sorts and replaces an indexed data set when the OUT= option is not specified. Without the FORCE option, PROC SORT does not sort and replace an indexed data set because sorting destroys user-created indexes for the data set. When you specify FORCE, PROC SORT sorts and replaces the data set and destroys all user-created indexes for the data set. Indexes that were created or required by integrity constraints are preserved.
Restriction:If you use PROC SORT with the FORCE option on data sets that were created with the Version 5 compatibility engine or with a sequential engine such as a tape format engine, you must also specify the OUT= option.
Tip:PROC SORT checks for the sort indicator before it sorts a data set so that data is not sorted again unnecessarily. By default, PROC SORT does not sort a data set if the sort information matches the requested sort. You can use FORCE to override this behavior. You might need to use FORCE if SAS cannot verify the sort specification in the data set option SORTEDBY=. For more information about SORTEDBY= , see the chapter on SAS data set options in SAS Data Set Options: Reference.
NODUPKEY
checks for and eliminates observations with duplicate BY values. If you specify this option, then PROC SORT compares all BY values for each observation to the ones for the previous observation that is written to the output data set. If an exact match is found, then the observation is not written to the output data set.
Operating Environment Information: If you use the VMS operating environment and are using the VMS host sort, the observation that is written to the output data set is not always the first observation of the BY group.
Interactions:When you are removing observations with duplicate BY values with NODUPKEY, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.

In-database sorting occurs when the NODUPKEY option is specified and the system option SQLGENERATION= is assigned a DBMS and the system option SORTPGM=BEST.

Options NODUPKEY and NOUNIQUEKEY are not compatible. If these options are specified together, an error will be printed to the SAS log.

Tips:Use the EQUALS option with the NODUPKEY option for consistent results in your output data sets.

The DUPOUT= option can be used with the NODUPKEY option. It cannot be combined with the NOUNIQUEKEY option.

NOEQUALS
See EQUALS | NOEQUALS.
NOTHREADS
See THREADS | NOTHREADS.
NOUNIQUEKEY
checks for and eliminates observations from the output data set that have a unique sort key. A sort key is unique when the observation containing the key is the only observation within a BY group. An observation has a unique sort key when it is the only observation within a BY group.
Note: Unlike NODUPKEY, which writes one observation of a BY group to the output data set and discards all other observations from the BY group, the NOUNIQUEKEY maintains BY group integrity. Either all observations of a BY group are written to the output data set when the BY group consists of two or more observations, or all observations of the BY group are discarded when the BY group consists of a single observation.
Alias:NOUNIKEY | NOUNIKEYS | NOUNIQUEKEYS
Interaction:Options NODUPKEY and NOUNIQUEKEY are not compatible. If NODUPKEY and NOUNIQUEKEY are specified together, an error will be printed to the SAS log.
Tip:The UNIQUEOUT= option can be used with the NOUNIQUEKEY option. It cannot be combined with the NODUPKEY option.
See:UNIQUEOUT= to direct the observations that have been eliminated to an output data set.
OUT= SAS-data-set
names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.
CAUTION:
Use care when you use PROC SORT without OUT=.
Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.
Default:Without OUT=, PROC SORT overwrites the original data set.
Tips:With in-database sorts, the output data set cannot refer to the input table on the DBMS.

You can use data set options with OUT=.

OVERWRITE
enables the input data set to be deleted before the replacement output data set of the same name is populated with observations.
CAUTION:
Use the OVERWRITE option only with a data set that is backed up or with a data set that you can reconstruct.
Because the input data set is deleted, data will be lost if a failure occurs while the output data set is being written.
Restrictions:If the OVERWRITE and the OUT= options are specified and the OUT= data set name is not the same as the INPUT data set name, SAS will not overwrite the INPUT data set.

The OVERWRITE option has no effect if you also specify the TAGSORT option. You cannot overwrite the input data set because TAGSORT must reread the input data set while populating the output data set.

The OVERWRITE option is supported by the SAS sort and SAS multi-threaded sort only. The option has no effect if you are using a host sort.

Tip:Using the OVERWRITE option can reduce disk space requirements.
PRESORTED
before sorting, checks within the input data set to determine whether the sequence of observations are in order. Use the PRESORTED option when you know or strongly suspect that a data set is already in order according to the key variables that are specified in the BY statement. By specifying this option, you avoid the cost of sorting the data set.
Interaction:Sequence checking is not performed when the FORCE option is specified.
Tips:You can use the DATA step to import data, from external text files, in a sequence compatible with SAS processing and according to the sort order specified by the combination of SORT options and key variables listed in the BY statement. You can then specify the PRESORTED option if you know or highly suspect that the data is sorted accordingly.

Using the PRESORTED option with ACCESS engines and DBMS data is not recommended. These external databases are not guaranteed to return observations in sorted order unless an ORDER BY clause is specified in a query. Generally, physical ordering is not a concept that external databases use. Therefore, these databases are not guaranteed to return observations in the same order when executing a query multiple times. Physical order can be important for producing consistent, repeatable results when processing data. Without a repeatable data retrieval order, PROC SORT does not guarantee the return of observations in the same order from one PROC SORT execution to another, even when the EQUALS | NOEQUALS option is used to request sort stability. Without a repeatable retrieval order, the detection and elimination of adjacent duplicate records by PROC SORT can also vary from one PROC SORT execution to another.

See:System option SORTVALIDATE System Option in SAS System Options: Reference.
REVERSE
sorts character variables using a collating sequence that is reversed from the normal collating sequence.
Operating Environment Information: For information about the normal collating sequence for your operating environment, see EBCDIC Order, ASCII Order, and the SAS documentation for your operating environment.
Restriction:The REVERSE option cannot be used with a collating-sequence-option. You can specify either a collating-sequence-option or the REVERSE option in PROC SORT, but you cannot specify both.
Interaction:Using REVERSE with the DESCENDING option in the BY statement restores the sequence to the normal order.
See:The DESCENDING option in the BY statement. The difference is that the DESCENDING option can be used with both character and numeric variables.
SORTSIZE=memory-specification
specifies the maximum amount of memory that is available to PROC SORT. Valid values for memory-specification are as follows:
MAX
specifies that all available memory can be used.
n
specifies the amount of memory in bytes, where n is a real number.
nK
specifies the amount of memory in kilobytes, where n is a real number.
nM
specifies the amount of memory in megabytes, where n is a real number.
nG
specifies the amount of memory in gigabytes, where n is a real number.
Specifying the SORTSIZE= option in the PROC SORT statement temporarily overrides the SAS system option. For more information, see SORTSIZE= System Option in SAS System Options: Reference.
Operating Environment Information: Some system sort utilities might treat this option differently. Refer to the SAS documentation for your operating environment.
Alias:SIZE=
Default:the value of the SAS system option SORTSIZE=
Tips:Setting the SORTSIZE= option in the PROC SORT statement to MAX or 0, or not setting the SORTSIZE= option, limits the PROC SORT to the available physical memory based on the settings of the SAS system options REALMEMSIZE and MEMSIZE.

For information about the SAS system options REALMEMSIZE and MEMSIZE, see the SAS documentation for your operating environment.

TAGSORT
stores only the BY variables and the observation numbers in temporary files. The BY variables and the observation numbers are called tags. At the completion of the sorting process, PROC SORT uses the tags to retrieve records from the input data set in sorted order.
Note: The utility file created is much smaller than it would be if the TAGSORT option were not specified.
Restriction:The TAGSORT option is not compatible with the OVERWRITE option.
Interaction:The TAGSORT option is not supported by the multi-threaded sort.
Tip:When the total length of BY variables is small compared with the record length, TAGSORT reduces temporary disk usage considerably. However, processing time might be much higher.
THREADS | NOTHREADS
enables or prevents the activation of multi-threaded sorting.
Default:the value of the THREADS | NOTHREADS SAS system option. Note that the default can be overridden using the procedure THREADS | NOTHREADS option.
Restrictions:Your site administrator can create a restricted options table. A restricted options table specifies SAS system option values that are established at start-up and cannot be overridden. If the THREADS | NOTHREADS system option is listed in the restricted options table, any attempt to set these system options is ignored and a warning message is written to the SAS log.

If a failure occurs when adding the THREADS | NOTHREADS procedure option using the SPD engine, PROC SORT stops processing and writes a message to the SAS log.

Interactions:The PROC SORT THREADS | NOTHREADS options override the SAS system THREADS | NOTHREADS options unless the system option is restricted. (See Restriction.) For more information, see THREADS System Option in SAS System Options: Reference.

The THREADS system option is honored if PROC SORT determines that multi-threaded processing is deemed to be beneficial. If the value of the SAS system option CPUCOUNT=1, then multi-threaded processing is not beneficial. However, you can specify the PROC SORT THREADS option to force multi-threaded processing when the system option is set to NOTHREADS or when the system option is THREADS and the procedure option is NOTHREADS. This option combination prevents multi-threaded processing and overrides the actions taken that are based on the system options. Note that when multi-threaded sorting is in effect and NOEQUALS is specified, observations within BY groups might be returned in an unpredictable order.

If multi-threaded SAS sort is being used, the UTILLOC= system option will affect the placement of utility files. Thread-enabled SAS applications are able to create temporary files that can be accessed in parallel by separate threads. For more information, see UTILLOC= .

The TAGSORT option is not supported by the multi-threaded sort. Specifying the TAGSORT option will prevent multi-threaded processing.

See:Multi-threaded Sorting and Support for Parallel Processing in SAS Language Reference: Concepts.
UNIQUEOUT= SAS-data-set
specifies the output data set for observations eliminated by the NOUNIQUEKEY option.
Alias:UNIOUT=
Interaction:The DUPOUT= and UNIOUT= options are not compatible and cannot be specified simultaneously.
Tip:The UNIQUEOUT= option can be used with the NOUNIQUEKEY option. It cannot be combined with the NODUPKEY option.