Previous Page | Next Page

The SORT Procedure

PROC SORT Statement


PROC SORT <collating-sequence-option> <other option(s)>;

Task Option
Specify the collating sequence

Specify ASCII ASCII

Specify EBCDIC EBCDIC

Specify Danish DANISH

Specify Finnish FINNISH

Specify Norwegian NORWEGIAN

Specify Polish POLISH

Specify Swedish SWEDISH

Specify a customized sequence NATIONAL

Specify any of the collating sequences listed above (ASCII, EBCDIC, DANISH, FINNISH, ITALIAN, NORWEGIAN, POLISH, SPANISH, SWEDISH, or NATIONAL), the name of any other system provided translation table (POLISH, SPANISH), and the name of a user-created translation table. You can specify an encoding. You can also specify either the keyword LINGUISTIC or UCA to achieve a locale appropriate collating sequence. SORTSEQ=
Specify the input data set DATA=
Sort a SAS data set without changing the created and modified dates DATECOPY
Create output data sets

Specifies the output data set OUT=

Specifies the output data set to which duplicate observations are written DUPOUT=
Specify the output order

Maintain relative order within BY groups EQUALS

Do not maintain relative order within BY groups NOEQUALS
Eliminate duplicate observations

Delete observations with duplicate BY values NODUPKEY

Delete duplicate observations NODUPRECS
Delete the input data set before the replacement output data set is populated. OVERWRITE
Specify whether or not the data set is likely already sorted. PRESORTED

Reverse the collation order for character variables REVERSE
Specify the available memory SORTSIZE=
Force redundant sorting FORCE
Reduce temporary disk usage TAGSORT
Override SAS system option THREADS

Enable multi-threaded sorting THREADS

Prevent multi-threaded sorting NOTHREADS


Options

Options can include one collating-sequence-option and multiple other options. The order of the two types of options does not matter and both types are not necessary in the same PROC SORT step.

Collating-Sequence-Options

Operating Environment Information:   For information about behavior specific to your operating environment for the DANISH, FINNISH, NORWEGIAN, or SWEDISH collating-sequence-option, see the SAS documentation for your operating environment.  [cautionend]

Restriction: You can specify only one collating-sequence-option in a PROC SORT step.
ASCII

sorts character variables using the ASCII collating sequence. You need this option only when you want to achieve an ASCII ordering on a system where EBCDIC is the native collating sequence.

See also: Sorting Orders for Character Variables
DANISH
NORWEGIAN

sorts characters according to the Danish and Norwegian convention.

The Danish and Norwegian collating sequence is shown in National Collating Sequences of Alphanumeric Characters.

EBCDIC

sorts character variables using the EBCDIC collating sequence. You need this option only when you want to achieve an EBCDIC ordering on a system where ASCII is the native collating sequence.

See also: Sorting Orders for Character Variables
POLISH

sorts characters according to the Polish convention.

FINNISH
SWEDISH

sorts characters according to the Finnish and Swedish convention. The Finnish and Swedish collating sequence is shown in National Collating Sequences of Alphanumeric Characters.

NATIONAL

sorts character variables using an alternate collating sequence, as defined by your installation, to reflect a country's National Use Differences. To use this option, your site must have a customized national sort sequence defined. Check with the SAS Installation Representative at your site to determine whether a customized national sort sequence is available.

NORWEGIAN

See DANISH.

SWEDISH

See FINNISH.

SORTSEQ= collating-sequence

specifies the collating sequence. The collating-sequence can be a collating-sequence-option, a translation table, an encoding, or the keyword LINGUISTIC. Only one collating sequence can be specified. For detailed information, refer to the Collating Sequence section in the SAS National Language Support (NLS): Reference Guide.

Here are descriptions of the collating sequences:

collating--sequence--option | translation_table

specifies either a translation table, which can be one that SAS provides or any user-defined translation table, or one of the PROC SORT statement Collating-Sequence-Options. For an example of using PROC TRANTAB and PROC SORT with SORTSEQ=, see Using Different Translation Tables for Sorting in SAS National Language Support (NLS): Reference Guide.

The available translation tables are

ASCII

DANISH

EBCDIC

FINNISH

ITALIAN

NORWEGIAN

POLISH

REVERSE

SPANISH

SWEDISH

The following figure shows how the alphanumeric characters in each language will sort.

National Collating Sequences of Alphanumeric Characters

[National Collating Sequences of Alphanumeric Characters]

Restriction: You can specify only one collating-sequence-option in a PROC SORT step.
Interaction: In-database processing will not occur when the SORTSEQ= option is specified.
Tip: The SORTSEQ= collating-sequence options are specified without parenthesis and have no arguments associated with them. An example of how to specify a collating sequence follows:

proc sort data=mydata SORTSEQ=ASCII;
encoding-value

specifies an encoding value. The result is the same as a binary collation of the character data represented in the specified encoding. See the supported encoding values in the SAS National Language Support (NLS): Reference Guide.

Restriction: PROC SORT is the only procedure or part of the SAS system that recognizes an encoding specified for the SORTSEQ= option.
Tip: When the encoding value contains a character other than an alphanumeric character or underscore, the value needs to be enclosed in quotation marks.
See: The list of the encodings that can be specified in the SAS National Language Support (NLS): Reference Guide.
LINGUISTIC<(collating-rules )>

specifies linguistic collation, which sorts characters according to rules of the specified language. The rules and default collating-sequence options are based on the language specified in the current locale setting. The implementation is provided by the International Components for Unicode (ICU) library. It produces results that are largely compatible with the Unicode Collation Algorithms (UCA).

Alias: UCA
Restriction: The SORTSEQ=LINGUISTIC option is available only on the PROC SORT SORTSEQ= option and is not available for the SAS System SORTSEQ= option.
Restriction Note that linguistic collation is not supported on platforms VMS on Itanium (VMI) or 64-bit Windows on Itanium (W64).
Tip: The collating-rules must be enclosed in parentheses. More than one collating rule can be specified.
Tip: When BY processing is performed on data sets that are sorted with linguistic collation, the NOBYSORTED system option might need to be specified in order for the data set to be treated properly. BY processing is performed differently than collating sequence processing.
See: The ICU License agreement.
See: The section on Linguistic Collation in the SAS National Language Support (NLS): Reference Guide for detailed information.
See Also: Refer to http://www.unicode.org Web site for the Unicode Collation Algorithm (UCA) specification.

The following are the collation-rules that can be specified for the LINGUISTIC option. These rules modify the linguistic collating sequence:

ALTERNATE_HANDLING=SHIFTED

controls the handling of variable characters like spaces, punctuation, and symbols. When this option is not specified (using the default value Non-Ignorable), differences among these variable characters are of the same importance as differences among letters. If the ALTERNATE_HANDLING option is specified, these variable characters are of minor importance.

Default: NON_IGNORABLE
Tip: The SHIFTED value is often used in combination with STRENGTH= set to Quaternary. In such a case, spaces, punctuation, and symbols are considered when comparing strings, but only if all other aspects of the strings (base letters, accents, and case) are identical.
CASE_FIRST=

specifies the order of uppercase and lowercase letters. This argument is valid for only TERTIARY, QUATERNARY, or IDENTICAL levels. The following table provides the values and information for the CASE_FIRST argument:

Value Description
UPPER Sorts uppercase letters first, then the lowercase letters.
LOWER Sorts lowercase letters first, then the uppercase letters.

COLLATION=

The following table lists the available COLLATION= values: If you do not select a collation value, then the user's locale-default collation is selected.

Value Description
UPPER Sorts uppercase letters first, then the lowercase letters.
LOWER Sorts lowercase letters first, then the uppercase letters.

LOCALE= locale_name

specifies the locale name in the form of a POSIX name (for example, ja_JP). See the Values for the LOCALE= System Option in SAS National Language Support (NLS): Reference Guide for a list of locale and POSIX values supported by PROC SORT.

Restriction: The following locales are not supported by PROC SORT:

Afrikaans_SouthAfrica, af_ZA

Cornish_UnitedKingdom, kw_GB

ManxGaelic_UnitedKingdom, gv_GB

NUMERIC_COLLATION=

orders integer values within the text by the numeric value instead of characters used to represent the numbers.

Value Description
ON Order numbers by the numeric value. For example, "8 Main St." would sort before "45 Main St.".
OFF Order numbers by the character value. For example, "45 Main St." would sort before "8 Main St.".

Default: OFF
STRENGTH=

The value of strength is related to the collation level. There are five collation-level values. The following table provides information about the five levels. The default value for strength is related to the locale.

Value Type of Collation Description
PRIMARY or 1 PRIMARY specifies differences between base characters (for example, "a" < "b"). It is the strongest difference. For example, dictionaries are divided into different sections by base character.
SECONDARY or 2 Accents in the characters are considered secondary differences (for example, "as" < "às" < "at"). A secondary difference is ignored when there is a primary difference anywhere in the strings. Other differences between letters can also be considered secondary differences, depending on the language.
TERTIARY or 3 Upper and lowercase differences in characters are distinguished at the tertiary level (for example, "ao" < "Ao" < "aò"). A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings. Another example is the difference between large and small Kana.
QUATERNARY or 4 When punctuation is ignored at level 1-3, an additional level can be used to distinguish words with and without punctuation (for example, "ab" < "a-b" < "aB"). The quaternary level should be used if ignoring punctuation is required or when processing Japanese text. This difference is ignored when there is a primary, secondary, or tertiary difference.
IDENTICAL or 5 When all other levels are equal, the identical level is used as a tiebreaker. The Unicode code point values of the Normalization Form D (NFD) form of each string are compared at this level, just in case there is no difference at levels 1-4. This level should be used sparingly, because code-point value differences between two strings rarely occur. For example, only Hebrew cantillation marks are distinguished at this level.

Alias: LEVEL=
CAUTION:
If you use a host sort utility to sort your data, then specifying a translation-table-based collating sequence with the SORTSEQ= option might corrupt the character BY variables.

For more information, see the PROC SORT documentation for your operating environment.  [cautionend]

Other Options

DATA= SAS-data-set

identifies the input SAS data set.

Restriction: For in-database processing to occur, it is necessary that the data set refer to a table residing on the DBMS.
Main discussion: Input Data Sets
DATECOPY

copies the SAS internal date and time when the SAS data set was created and the date and time when it was last modified before the sort to the resulting sorted data set. Note that the operating environment date and time are not preserved.

Restriction: DATECOPY can be used only when the resulting data set uses the V8 or V9 engine.
Tip: You can alter the file creation date and time with the DTC= option in the MODIFY statement in PROC DATASETS. For more information, see MODIFY Statement.
DUPOUT= SAS-data-set

specifies the output data set to which duplicate observations are written.

Interaction: In-database processing does not occur when the DUPOUT= option is specified.
Tip: If the DUPOUT= data set name that is specified is the same as the INPUT data set name, SAS will not sort or overwrite the INPUT data set. Instead, SAS will generate an error message. The FORCE option must be specified in order to overwrite the INPUT data set with the DUPOUT= data set of the same name.
EQUALS | NOEQUALS

specifies the order of the observations in the output data set. For observations with identical BY-variable values, EQUALS maintains the relative order of the observations within the input data set in the output data set. NOEQUALS does not necessarily preserve this order in the output data set.

Default: EQUALS
Interaction: When you use NODUPRECS or NODUPKEY to remove observations in the output data set, the choice of EQUALS or NOEQUALS can affect which observations are removed.
Interaction: The EQUALS | NOEQUALS procedure option overrides the default sort stability behavior that is established with the SORTEQUALS | NOSORTEQUALS system option.
Interaction: The EQUALS option is supported by the multi-threaded sort. However, I/O performance might be reduced when using the EQUALS option with the multi-threaded sort because partitioned data sets will be processed as if they consist of a single partition.
Interaction: The NOEQUALS option is supported by the multi-threaded sort. The order of observations within BY groups that are returned by the multi-threaded sort might not be consistent between runs.
Tip: Using NOEQUALS can save CPU time and memory.
FORCE

sorts and replaces an indexed data set when the OUT= option is not specified. Without the FORCE option, PROC SORT does not sort and replace an indexed data set because sorting destroys user-created indexes for the data set. When you specify FORCE, PROC SORT sorts and replaces the data set and destroys all user-created indexes for the data set. Indexes that were created or required by integrity constraints are preserved.

Restriction: If you use PROC SORT with the FORCE option on data sets that were created with the Version 5 compatibility engine or with a sequential engine such as a tape format engine, you must also specify the OUT= option.
Tip: PROC SORT checks for the sort indicator before it sorts a data set so that data is not sorted again unnecessarily. By default, PROC SORT does not sort a data set if the sort information matches the requested sort. You can use FORCE to override this behavior. You might need to use FORCE if SAS cannot verify the sort specification in the data set option SORTEDBY=. For more information about SORTEDBY=, see the chapter on SAS data set options in SAS Language Reference: Dictionary.
NODUPKEY

checks for and eliminates observations with duplicate BY values. If you specify this option, then PROC SORT compares all BY values for each observation to the ones for the previous observation that is written to the output data set. If an exact match is found, then the observation is not written to the output data set.

Operating Environment Information:   If you use the VMS operating environment and are using the VMS host sort, the observation that is written to the output data set is not always the first observation of the BY group.  [cautionend]

Note:   See NODUPRECS for information about eliminating duplicate observations.  [cautionend]

Interaction: When you are removing observations with duplicate BY values with NODUPKEY, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.
Interaction: In-database sorting occurs when the NODUPKEY option is specified and the system option SQLGENERATION= is assigned a DBMS and the system option SORTPGM=BEST.
Tip: Use the EQUALS option with the NODUPKEY option for consistent results in your output data sets.
Featured in: Retaining the First Observation of Each BY Group
NODUPRECS

checks for and eliminates duplicate observations. If you specify this option, then PROC SORT compares all variable values for each observation to the ones for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.

Note:   See NODUPKEY for information about eliminating observations with duplicate BY values.  [cautionend]

Alias : NODUP
Interaction: When you are removing consecutive duplicate observations in the output data set with NODUPRECS, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.
Interaction: The action of NODUPRECS is directly related to the setting of the SORTDUP= system option. When SORTDUP= is set to LOGICAL, NODUPRECS removes duplicate observations based on the examination of the variables that remain after a DROP or KEEP operation on the input data set. Setting SORTDUP=LOGICAL increases the number of duplicate observations that are removed, because it eliminates variables before observation comparisons take place. Also, setting SORTDUP=LOGICAL can improve performance, because dropping variables before sorting reduces the amount of memory required to perform the sort. When SORTDUP= is set to PHYSICAL, NODUPRECS examines all variables in the data set, regardless of whether they have been kept or dropped. For more information about SORTDUP=, see the chapter on SAS system options in SAS Language Reference: Dictionary.
Interaction: In-database processing does not occur when the NODUPRECS option is specified. However, if the NODUPRECS and NODUPKEY options are specified, system option SQLGENERATION= set for in-database processing, and system option SORTPGM=BEST, the NODUPRECS option is ignored and in-database processing does occur.
Tip: Use the EQUALS option with the NODUPRECS option for consistent results in your output data sets.
Tip: Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables.
NOEQUALS

See EQUALS | NOEQUALS.

NOTHREADS

See THREADS|NOTHREADS.

OUT= SAS-data-set

names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.

CAUTION:
Use care when you use PROC SORT without OUT=.

Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.   [cautionend]

Default: Without OUT=, PROC SORT overwrites the original data set.
Tip: With in-database sorts, the output data set cannot refer to the input table on the DBMS.
Tip: You can use data set options with OUT=.
Featured in: Sorting by the Values of Multiple Variables
OVERWRITE

enables the input data set to be deleted before the replacement output data set is populated with observations.

Restriction: The OVERWIRTE option has no effect when an OUT= data set is specified.
Restriction: The OVERWRITE option has no effect if you also specify the TAGSORT option. You cannot overwrite the input data set because TAGSORT must reread the input data set while populating the output data set.
Restriction: The OVERWRITE option is supported by the SAS sort and SAS multi-threaded sort only. The option has no effect if you are using a host sort.
Tip: Using the OVERWRITE option can reduce disk space requirements.
CAUTION:
Use the OVERWRITE option only with a data set that is backed up or with a data set that you can reconstruct.

Because the input data set is deleted, data will be lost if a failure occurs while the output data set is being written.   [cautionend]

PRESORTED

before sorting, checks within the input data set to determine whether the sequence of observations are in order. Use the PRESORTED option when you know or strongly suspect that a data set is already in order according to the key variables that are specified in the BY statement. By specifying this option, you avoid the cost of sorting the data set.

Note:   See the NODUPRECS option for information about eliminating duplicate observations.  [cautionend]

Interaction: Sequence checking is not performed when the FORCE option is specified.
Interaction: When the NODUPRECS option has been specified and one or more variables have been dropped from the input data set, then the SORTDUP system option setting will affect the detection of adjacent duplicate observations.
Tip: You can use the DATA step to import data, from external text files, in a sequence compatible with SAS processing and according to the sort order specified by the combination of SORT options and key variables listed in the BY statement. You can then specify the PRESORTED option if you know or highly suspect that the data is sorted accordingly.
Tip: Using the PRESORTED option with ACCESS engines and DBMS data is not recommended. These external databases are not guaranteed to return observations in sorted order unless an ORDER BY clause is specified in a query. Generally, physical ordering is not a concept that external databases use. Therefore, these databases are not guaranteed to return observations in the same order when executing a query multiple times. Physical order can be important for producing consistent, repeatable results when processing data. Without a repeatable data retrieval order, PROC SORT does not guarantee the return of observations in the same order from one PROC SORT execution to another, even when the EQUALS option is used to request sort stability. Without a repeatable retrieval order, the detection and elimination of adjacent duplicate records (requested by specifying the NODUPRECS option) by PROC SORT can also vary from one PROC SORT execution to another.
See also: System option SORTVALIDATE in SAS Language Reference: Dictionary
REVERSE

sorts character variables using a collating sequence that is reversed from the normal collating sequence.

Operating Environment Information:   For information about the normal collating sequence for your operating environment, see EBCDIC Order, ASCII Order, and the SAS documentation for your operating environment.  [cautionend]

Restriction: The REVERSE option cannot be used with a collating-sequence-option. You can specify either a collating-sequence-option or the REVERSE option in a PROC SORT, but you cannot specify both.
Interaction: Using REVERSE with the DESCENDING option in the BY statement restores the sequence to the normal order.
See also: The DESCENDING option in the BY statement. The difference is that the DESCENDING option can be used with both character and numeric variables.
SORTSIZE=memory-specification

specifies the maximum amount of memory that is available to PROC SORT. Valid values for memory-specification are as follows:

MAX

specifies that all available memory can be used.

n

specifies the amount of memory in bytes, where n is a real number.

nK

specifies the amount of memory in kilobytes, where n is a real number.

nM

specifies the amount of memory in megabytes, where n is a real number.

nG

specifies the amount of memory in gigabytes, where n is a real number.

Specifying the SORTSIZE= option in the PROC SORT statement temporarily overrides the SAS system option SORTSIZE=. For more information about SORTSIZE=, see the chapter on SAS system options in SAS Language Reference: Dictionary.

Operating Environment Information:   Some system sort utilities might treat this option differently. Refer to the SAS documentation for your operating environment.  [cautionend]

Alias: SIZE=
Default: the value of the SAS system option SORTSIZE=
Tip: Setting the SORTSIZE= option in the PROC SORT statement to MAX or 0, or not setting the SORTSIZE= option, limits the PROC SORT to the available physical memory based on the settings of the SAS system options REALMEMSIZE and MEMSIZE. .

Operating Environment Information:   For information about the SAS system options REALMEMSIZE and MEMSIZE, see the SAS documentation for your operating environment.  [cautionend]

TAGSORT

stores only the BY variables and the observation numbers in temporary files. The BY variables and the observation numbers are called tags. At the completion of the sorting process, PROC SORT uses the tags to retrieve records from the input data set in sorted order.

Note:    The utility file created is much smaller than it would be if the TAGSORT option were not specified.   [cautionend]

Restriction: The TAGSORT option is not compatible with the OVERWRITE option.
Interaction: The TAGSORT option is not supported by the multi-threaded sort.
Tip: When the total length of BY variables is small compared with the record length, TAGSORT reduces temporary disk usage considerably. However, processing time might be much higher.
THREADS | NOTHREADS

enables or prevents the activation of multi-threaded sorting.

Default: the value of the THREADS | NOTHREADS SAS system option.

Note:   The default can be overridden using the procedure THREADS | NOTHREADS option.   [cautionend]

Restriction: Your site administrator can create a restricted options table. A restricted options table specifies SAS system option values that are established at startup and cannot be overridden. If the THREADS | NOTHREADS system option is listed in the restricted options table, any attempt to set these system options is ignored and a warning message is written to the SAS log.
Restriction: If a failure occurs when adding the THREADS | NOTHREADS procedure option using the SPD engine, PROC SORT stops processing and writes a message to the SAS log.
Interaction: The PROC SORT THREADS | NOTHREADS options override the SAS system THREADS | NOTHREADS options unless the system option is restricted. (See Restriction.) For more information about THREADS, see the chapter on SAS system options in SAS Language Reference: Dictionary.
Interaction: The THREADS system option is honored if PROC SORT determines that multi-threaded processing is deemed to be beneficial. If the value of the SAS system option CPUCOUNT=1, then multi-threaded processing is not beneficial. However, you can specify the PROC SORT THREADS option to force multi-threaded processing when the system option is set to NOTHREADS or when the system option is THREADS and the procedure option is NOTHREADS. This option combination prevents multi-threaded processing and overrides the actions taken that are based on the system options.

Note:   When multi-threaded sorting is in effect and NOEQUALS is specified, observations within BY groups might be returned in an unpredictable order.   [cautionend]

Interaction: If multi-threaded SAS sort is being used, the UTILLOC= system option will affect the placement of utility files. Thread-enabled SAS applications are able to create temporary files that can be accessed in parallel by separate threads. For more information about the UTILLOC= system option, see the chapter on SAS system options in SAS Language Reference: Dictionary.
Interaction: The TAGSORT option is not supported by the multi-threaded sort. Specifying the TAGSORT option will prevent multi-threaded processing.
See also: Multi-threaded Sorting and Support for Parallel Processing in SAS Language Reference: Concepts.

Previous Page | Next Page | Top of Page