SORT Procedure: UNIX

Sorts observations in a SAS data set by one or more variables, and then stores the resulting sorted observations in a new SAS data set or replaces the original data set.
UNIX specifics: sort utilities available
See: SORT Procedure in Base SAS Procedures Guide

Syntax

PROC SORT<options> <collating-sequence-option>;

Optional Argument

option
SORTSIZE=memory-specification
specifies the maximum amount of memory available to the SORT procedure. For more information about the SORTSIZE= option, see SORTSIZE= Option.
TAGSORT
stores only the BY variables and the observation numbers in temporary files. The TAGSORT option has no effect on a UNIX host that uses SyncSort.
For more information about the TAGSORT option, see TAGSORT Option.
DETAILS
specifies that PROC SORT write messages to the SAS log detailing whether the sort was performed in memory. (This option is a statement option.)
If the sort was not performed in memory, then the details that are written to the SAS log include the number of utility files that were used and their sizes.
Tip:Using the DETAILS option can help determine an ideal SORTSIZE value.

Details

The Basics

Note: This version is a simplified version of the SORT procedure syntax. For the complete syntax and its explanation, see SORT Procedure in Base SAS Procedures Guide.
The SORT procedure sorts observations in a SAS data set by one or more character or numeric variables, either replacing the original data set or creating a new, sorted data set. By default under UNIX, the SORT procedure uses the ASCII collating sequence.
The SORT procedure uses the sort utility that is specified by the SORTPGM system option. Sorting can be done by SAS or by the syncsort utility. You can use all of the options that are available to the SAS sort utility, such as the SORTSEQ and NODUPKEY options. In some situations, you can improve your performance by using the NOEQUALS option. If you specify an option that is not supported by the host sort, then the SAS sort will be used instead. For more information about all of the options that are available, see SORT Procedure in Base SAS Procedures Guide.

SORTSIZE= Option

Limiting the Amount of Memory Available to PROC SORT

You can use the SORTSIZE= system option in the PROC SORT statement to limit the amount of memory that is available to the SORT procedure. This option can reduce the amount of swapping SAS must do to sort the data set.
Note: If you do not specify the SORTSIZE= option, PROC SORT uses the value of the SORTSIZE system option. The SORTSIZE system option can be defined in the command line or in the SAS configuration file.

Syntax of the SORTSIZE= Option

The syntax of the SORTSIZE= system option is as follows:
SORTSIZE=memory-specification
memory-specification can be one of the following:
n specifies the amount of memory in bytes.
nK specifies the amount of memory in 1-kilobyte multiples.
nM specifies the amount of memory in 1-megabyte multiples.
nG specifies the amount of memory in 1-gigabyte multiples.

Default Value of the SORTSIZE= Option

The default SAS configuration file sets this option based on the value of the SORTSIZE system option. To view the default value for your operating environment, execute the following code:
proc options option=sortsize;
run;
You can override the default value of the SORTSIZE system option in one of the following ways:
  • by specifying a different SORTSIZE= value in the PROC SORT statement
  • by submitting an OPTIONS statement that sets the SORTSIZE system option to a new value
  • by setting the SORTSIZE system option in the command line during the invocation of SAS

Improving Performance with the SORTSIZE= Option

The SORTSIZE system option limits the amount of memory that is available to PROC SORT. In general, you should set the SORTSIZE= option to no larger than the amount of memory that is available to the SAS process through the MEMSIZE option.
When the SORTSIZE= value is large enough to fit the entire data set in memory, you can achieve optimal sort performance provided that your computer system has the same SORTSIZE= value of physical RAM free. If you do not have enough of physical RAM, then your computer will start swapping the extra memory pages to disk and negate the performance gains of using memory.
If the entire data set to be sorted will not fit in the space that is allocated by SORTSIZE, SAS creates a temporary utility file to store the data. In this case, SAS uses a sort algorithm that is tuned to sort using disk space instead of memory. These temporary utility files are placed in the SAS WORK location, but these files can be pointed to a different file system so that I/O is not impeded when you use the UTILLOC system option.
A SORTSIZE value of up to 512M is generally the optimal value to use if your data set can be sorted in memory. You should never set SORTSIZE to a value that is greater than 512M unless you can fit the file in physical RAM for sorting. SORTSIZE should always be set to a value that is at least 8M smaller than MEMSIZE.
Note: You can also use the SORTSIZE system option, which has the same effect as the SORTSIZE= option in the PROC SORT statement.

TAGSORT Option

The TAGSORT option in the PROC SORT statement is useful when there might not be enough disk space to sort a large SAS data set. When you specify the TAGSORT option, only the sort keys (that is, the variables specified in the BY statement) and the observation number for each observation are stored in the temporary utility files. The sort keys, together with the observation number, are referred to as tags. At the completion of the sorting process, the tags are used to retrieve the records from the input data set in sorted order. Thus, in cases where the total number of bytes of the sort keys is small compared with the length of the record, temporary disk use is reduced considerably.
You must have enough disk space to hold an additional copy of the data set (the output data set) and the utility file that contains the tags. By default, this utility file is stored in the Work library. If this directory is too small, you can change this directory by using the WORK system option. For more information, see WORK System Option: UNIX.
Note: Note that while using the TAGSORT option might reduce temporary disk use, the processing time could be higher. However, on systems with limited available disk space, the TAGSORT option might enable data sets to be sorted in situations where that would otherwise not be possible.

Disk Space Considerations for PROC SORT

You need to consider the following information when determining the amount of disk space needed to run PROC SORT:
input SAS data set
PROC SORT uses the SAS input data set specified by the DATA= option.
output SAS data set
PROC SORT stores the output SAS data set in the location that is specified by the OUT= option. If you use the SAS single-threaded sort, and the OUT= option is not specified, PROC SORT stores the output SAS data set in the Work library.
utility file
UTILLOC affects the storage location of the utility file only when the SAS multi-threaded sort is used. The SAS single-threaded sort still stores its utility file in the Work directory. If the UTILLOC system option is not set, the utility file is stored in the Work directory. The utility file is the size of the uncompressed input SAS data file, and includes additional sortkey data that is appended to the front of each record.
Note: You can use the UTILLOC system option to specify a location in which applications can store utility files.
temporary output SAS data set
During the sort, PROC SORT creates its output in the directory specified in the OUT= option (or directory of the input SAS data set if the OUT= option is not specified). The temporary data set has the same filename as the original data set, except it has an extension of .lck. After the sort completes successfully, the original data set is deleted, and the temporary data set is renamed to match the original data set. Therefore, you need to have enough available disk space in the target directory to hold two copies of the data set.
You can reduce the amount of disk space that is needed by specifying the OVERWRITE option in the PROC SORT statement. When OVERWRITE is specified, SORT, if possible, deletes the input data set before it attempts to write the replacement output data set. Deleting the input data set first can free storage space. This option should be used only with a data set that is backed up, or with a data set that you can reconstruct. For more information, see SORT Procedure in Base SAS Procedures Guide.

Performance Tuning for PROC SORT

How SAS Determines the Amount of Memory to Use

The MEMSIZE system option limits the amount of memory that is available to the SAS process. The SORTSIZE system option limits the amount of memory that is available to PROC SORT. The REALMEMSIZE system option specifies the amount of real (not virtual) memory that will be made available to SAS.
While memory settings below the default values for MEMSIZE and SORTSIZE might adversely affect sorting and SAS performance, making large amounts of memory available might be of no benefit. The key for determining whether additional memory might improve performance is whether the sort will fit in memory. If the sorted file requires more memory than is allocated, then a SORTSIZE value in the range of 64–512M is generally the optimal value. SORTSIZE should always be set to a value that is at least 8M smaller than MEMSIZE.
For information about setting the REALMEMSIZE system option, see REALMEMSIZE System Option: UNIX.
Note: If you receive an out of memory error, then increase the value of MEMSIZE. For more information, see MEMSIZE System Option: UNIX.

Guidelines for Setting the REALMEMSIZE System Option

You can use the REALMEMSIZE system option with PROC SORT to determine how much memory to use. It is important that the REALMEMSIZE value reflects the amount of memory that is available on your system. For optimal performance, the maximum value for the memory setting for all of your applications (including file cache), should never exceed the amount of physical RAM on your computer. The default value for REALMEMSIZE is 80% of the MEMSIZE setting. If REALMEMSIZE is set too high, then PROC SORT might use more memory than is actually available. Using too much memory will cause excessive paging and adversely impact system performance.
In general, REALMEMSIZE should be set to the amount of physical memory (not including swap space) that you expect to be available to SAS at run time. A good starting value is the amount of physical memory installed on the computer less the amount that is being used by running applications and the operating system. You can experiment with the REALMEMSIZE value until you reach optimum performance for your environment. In some cases, optimum performance can be achieved with a very low REALMEMSIZE value. A low value could cause SAS to use less memory and leave more memory for the operating system to perform I/O caching.
For more information, see REALMEMSIZE System Option: UNIX.

Using Other Options That Affect Performance

The THREADS system option controls whether threaded procedures will use threads. It is available as both a system option and as a procedural override in PROC SORT.
The CPUCOUNT option is directly related to the THREADS option and defaults to the number of CPUs on your computer. Depending on your file system and the number of concurrent users, you might benefit from lowering the CPUCOUNT on machines that have many CPUs. When the value of CPUCOUNT equals ACTUAL, SAS returns the number of physical CPUs that are associated with the operating environment where SAS is executing.
The UTILLOC system option allows for the spreading of utility files, and is a good option for balancing I/O.
The DETAILS option, specified in the PROC SORT statement, causes PROC SORT to write messages to the SAS log detailing whether the sort was performed in memory. If the sort was not performed in memory, then the details that are written include the number of utility files and their sizes.
For more information about the THREADS, CPUCOUNT, and UTILLOC system options see SAS System Options: Reference.

Creating Your Own Collating Sequences

If you want to provide your own collating sequences or change a collating sequence provided for you, use the TRANTAB procedure to create or modify translation tables. For more information, see TRANTAB Procedure in SAS National Language Support (NLS): Reference Guide. When you create your own translation tables, they are stored in your Sasuser.Profile catalog, and they override any translation tables by the same name that are stored in the Host catalog.
Note: System managers can modify the Host catalog by copying newly created tables from the Profile catalog to the Host catalog. Then, all users can access the new or modified translation table.
If you are using the SAS windowing environment and want to see the names of the collating sequences that are stored in the Host catalog, issue the following command from any window:
catalog sashelp.host
If you are not using the SAS windowing environment, then issue the following statements to generate a list of the contents in the Host catalog:
proc catalog catalog=sashelp.host;
contents;
run;
Entries of type TRANTAB are the collating sequences.
To see the contents of a particular translation table, use the following statements:
proc trantab table=table-name;
list;
run;
The contents of collating sequences are displayed in the SAS log.

Specifying the Host Sort Utility

Introduction to Using the Host Sort

SAS supports one host sort utility on UNIX called syncsort. You can use this sorting application as an alternative sorting algorithm to the SAS sort. SAS determines which sort to use by the values that are set for the SORTNAME, SORTPGM, SORTCUT, and SORTCUTP system options.

Setting the Host Sort Utility as the Sort Algorithm

To specify a host sort utility as the sort algorithm, complete the following steps:
  1. Specify the name of the host utility (syncsort) in the SORTNAME system option.
  2. Set the SORTPGM system option to tell SAS when to use the host sort utility.
    • If you specify SORTPGM=HOST, then SAS always prefers to use the host sort utility.
    • If you specify SORTPGM=BEST, then SAS chooses the best sorting method (either the SAS sort or the host sort) for the situation.

Sorting Based on Size or Observations

The sort routine that SAS uses can be based on either the number of observations in a data set, or on the size of the data set. When the SORTPGM system option is set to BEST, SAS uses the first available and pertinent sorting algorithm based on the following order of precedence:
  • host sort utility
  • SAS sort utility
The SORTCUT system option is based on the number of observations in a data set. The SORTCUTP system option is based on the size of the data set. SAS looks at the values for the SORTCUT and SORTCUTP system options to determine which sort routine to use. If the number of observations is greater than or equal to the value of SORTCUT, SAS uses the host sort utility. If the number of bytes in a data set is greater than the value of SORTCUTP, SAS uses the host sort utility.
If SORTCUT and SORTCUTP are set to zero, SAS uses the SAS sort utility. If you specify both system options, and either condition is met, SAS uses the host sort utility.
When the following OPTIONS statement is in effect, the host sort utility (syncsort) is used when the number of observations is 500 or greater:
options sortpgm=best sortcut=500;
In this example, the host sort utility is used when the size of the data set is greater than 40M:
options sortpgm=best sortcutp=40M;
For more information about these sort options, see SORTCUT System Option: UNIX, SORTCUTP System Option: UNIX, and SORTPGM System Option: UNIX.

Changing the Location of Temporary Files Used by the Host Sort Utility

By default, the host sort utilities use the location that is specified in the -WORK option for temporary files. To change the location of these temporary files, specify a location by using the SORTDEV system option. Here is an example:
options sortdev="/tmp/host";
For more information, see SORTDEV System Option: UNIX.

Passing Options to the Host Sort Utility

To specify options for the sort utility, use the SORTANOM system option. For a list of valid options, see SORTANOM System Option: UNIX.

Passing Parameters to the Host Sort Utility

To pass parameters to the sort utility, use the SORTPARM system option. The parameters that you can specify depend on the host sort utility. For more information, see SORTPARM System Option: UNIX.

Specifying the SORTSEQ= Option with a Host Sort Utility

The SORTSEQ= option enables you to specify the collating sequence for your sort. For a list of valid values, see SORT Procedure in Base SAS Procedures Guide.
CAUTION:
If you are using a host sort utility to sort your data, then specifying the SORTSEQ= option might corrupt the character BY variables if the sort sequence translation table and its inverse are not one-to-one mappings.
In other words, for the sort to work, the translation table must map each character to a unique weight, and the inverse table must map each weight to a unique character.
If your translation tables do not map one-to-one, then you can use one of the following methods to perform your sort:
  • Create a translation table that maps one-to-one. Once you create a translation table that maps one-to-one, you can easily create a corresponding inverse table using the TRANTAB procedure. If your translation table is not mapped one-to-one, then you will receive the following note in the SAS log when you try to create an inverse table:
    NOTE:  This table cannot be mapped one to one.
    For more information, see TRANTAB Procedure in SAS National Language Support (NLS): Reference Guide.
  • Use the SAS sort. You can specify the SAS sort using the SORTPGM system option. For more information, see SORTPGM System Option: UNIX.
  • Specify the collation order options of your host sort utility. See the documentation for your host sort utility for more information.
  • Create a view with a single BY variable. For an example, see Creating a View with a Single BY Variable.
Note: After using one of these methods, you might need to perform subsequent BY processing using either the NOTSORTED option or the NOBYSORTED system option. For more information about the NOTSORTED option, see BY Statement in SAS Statements: Reference. For more information about the NOBYSORTED system option, see BYSORTED System Option in SAS System Options: Reference.

Example: Creating a View with a Single BY Variable

The following example shows how to create a view by using a single BY variable. SAS uses the BEST argument in the SORTPGM system option to sort the data. By using BEST, SAS selects either the host sort or the SAS sort. (Sorting can also be performed by a DBMS when you use a SAS/ACCESS engine.)
options sortpgm=best msglevel=i;

data one;
   input name $ age;
   datalines;
Anne 35
ALBERT 10
JUAN 90
Janet 5
Bridget 23
BRIAN 45
;

data oneview / view=oneview;
   set one;
   name1=upcase(name);
run;

proc sort data=oneview out=final(drop=name1);
   by name1;
run;

proc print data=final;
run;
Log Output
NOTE: SAS threaded sort was used.
Output from Creating a View with a Single BY Variable
Output from Creating a View with a Single BY Variable