Previous Page | Next Page

Procedures under UNIX

SORT Procedure: UNIX



Sorts observations in a SAS data set by one or more variables, and then stores the resulting sorted observations in a new SAS data set or replaces the original data set.
UNIX specifics: sort utilities available
See: SORT Procedure in Base SAS Procedures Guide

Syntax
Details
SORTSIZE= Option
Limiting the Amount of Memory Available to PROC SORT
Syntax of the SORTSIZE= Option
Default Value of the SORTSIZE= Option
Improving Performance with the SORTSIZE= Option
TAGSORT Option
Disk Space Considerations for PROC SORT
Performance Tuning for PROC SORT
How SAS Determines the Amount of Memory to Use
Guidelines for Setting the REALMEMSIZE System Option
Using Other Options that Affect Performance
Creating Your Own Collating Sequences
Specifying the Host Sort Utility
Introduction to Using the Host Sort
Setting the Host Sort Utility as the Sort Algorithm
Sorting Based on Size or Observations
Changing the Location of Temporary Files Used by the Host Sort Utility
Passing Options to the Host Sort Utility
Passing Parameters to the Host Sort Utility
Specifying the SORTSEQ= Option with a Host Sort Utility
Example: Creating a View with a Dummy BY Variable
See Also

Syntax

PROC SORT<option(s)><collating-sequence-option>

Note:   

This version is a simplified version of the SORT procedure syntax. For the complete syntax and its explanation, see the SORT procedure in the Base SAS Procedures Guide.  [cautionend]

option
SORTSIZE=memory-specification

specifies the maximum amount of memory available to the SORT procedure. For more information about the SORTSIZE= option, see Details .

TAGSORT

stores only the BY variables and the observation numbers in temporary files. The TAGSORT option has no effect on a UNIX host that uses SyncSort.

For more information about the TAGSORT option, see Details.

DETAILS

specifies that PROC SORT write messages to the SAS log detailing whether the sort was performed in memory. (This option is a statement option.)

If the sort was not performed in memory, then the details that are written to the SAS log include the number of utility files that were used and their sizes.

Tip: Using the DETAILS option can help determine an ideal SORTSIZE value.

Details

The SORT procedure sorts observations in a SAS data set by one or more character or numeric variables, either replacing the original data set or creating a new, sorted data set. By default under UNIX, the SORT procedure uses the ASCII collating sequence.

The SORT procedure uses the sort utility that is specified by the SORTPGM system option. Sorting can be done by SAS or by the syncsort utility. You can use all of the options that are available to the SAS sort utility, such as the SORTSEQ and NODUPKEY options. In some situations, you can improve your performance by using the NOEQUALS option. If you specify an option that is not supported by the host sort, then the SAS sort will be used instead. For more information about all of the options that are available, see the SORT procedure in the Base SAS Procedures Guide.


SORTSIZE= Option


Limiting the Amount of Memory Available to PROC SORT

You can use the SORTSIZE= option in the PROC SORT statement to limit the amount of memory that is available to the SORT procedure. This option can reduce the amount of swapping SAS must do to sort the data set.

Note:   If you do not specify the SORTSIZE= option, PROC SORT uses the value of the SORTSIZE system option. The SORTSIZE system option can be defined in the command line or in the SAS configuration file.  [cautionend]


Syntax of the SORTSIZE= Option

The syntax of the SORTSIZE= option is as follows:

SORTSIZE=memory-specification

memory-specification can be one of the following:

n

specifies the amount of memory in bytes.

nK

specifies the amount of memory in 1-kilobyte multiples.

nM

specifies the amount of memory in 1-megabyte multiples.

nG

specifies the amount of memory in 1-gigabyte multiples.


Default Value of the SORTSIZE= Option

The default SAS configuration file sets this option based on the value of the SORTSIZE system option. To view the default value for your operating environment, execute the following code:

proc options option=sortsize;
run;

You can override the default value of the SORTSIZE system option in one of the following ways:


Improving Performance with the SORTSIZE= Option

The SORTSIZE system option limits the amount of memory that is available to PROC SORT. In general, you should set the SORTSIZE= option no larger than the amount of physical memory that is available to the SAS process.

Setting SORTSIZE memory below the default range might adversely affect sorting and SAS performance in general. Setting SORTSIZE memory to a value that is greater than the default might not necessarily improve performance. The key indicator as to whether additional memory can improve performance is whether the sort will fit in memory.

When the SORTSIZE= value is large enough to sort the entire data set in memory, you can achieve optimal sort performance. If the entire data set to be sorted will not fit in memory, SAS creates a temporary utility file to store the data. In this case, SAS uses a sort algorithm that is tuned to sort using disk space instead of memory.

If the SORTSIZE= value is larger than the amount of available memory, then the operating system will be forced to page excessively. If the SORTSIZE= value is too small, then not all of the sorting can be done in memory, which also results in more disk I/O.

If the data set to be sorted does not fit in memory, then setting a SORTSIZE value in the 64-128M range is generally the optimal value. SORTSIZE should always be set to a value that is at least 8M smaller than MEMSIZE.

Note:   You can also use the SORTSIZE system option, which has the same effect as the SORTSIZE= option in the PROC SORT statement.  [cautionend]


TAGSORT Option

The TAGSORT option in the PROC SORT statement is useful when there might not be enough disk space to sort a large SAS data set. When you specify the TAGSORT option, only the sort keys (that is, the variables specified in the BY statement) and the observation number for each observation are stored in the temporary utility files. The sort keys, together with the observation number, are referred to as tags. At the completion of the sorting process, the tags are used to retrieve the records from the input data set in sorted order. Thus, in cases where the total number of bytes of the sort keys is small compared with the length of the record, temporary disk use is reduced considerably.

You must have enough disk space to hold an additional copy of the data set (the output data set) and the utility file that contains the tags. By default, this utility file is stored in the Work library. If this directory is too small, you can change this directory by using the WORK system option. For more information, see WORK System Option: UNIX.

Note that while using the TAGSORT option might reduce temporary disk use, the processing time could be higher. However, on systems with limited available disk space, the TAGSORT option might enable data sets to be sorted in situations where that would otherwise not be possible.


Disk Space Considerations for PROC SORT

You need to consider the following information when determining the amount of disk space needed to run PROC SORT:

input SAS data set

PROC SORT uses the SAS input data set specified by the DATA= option.

output SAS data set

PROC SORT stores the output SAS data set in the location specified by the OUT= option. If the OUT= option is not specified, PROC SORT stores the output SAS data set in the Work library.

utility file

If the UTILLOC system option is not set, the utility file is stored in the Work directory. This utility file is approximately the size of the input SAS data set.

Note:   You can use the UTILLOC system option to specify a location in which applications can store utility files.  [cautionend]

temporary output SAS data set

During the sort, PROC SORT creates its output in the directory specified in the OUT= option (or directory of the input SAS data set if the OUT= option is not specified). The temporary data set has the same filename as the original data set, except it has an extension of .lck. After the sort completes successfully, the original data set is deleted, and the temporary data set is renamed to match the original data set. Therefore, you need to have enough available disk space in the target directory to hold two copies of the data set.

You can reduce the amount of disk space needed by specifying the OVERWRITE option in the PROC SORT statement. When you specify this option, SAS replaces the input data set with the sorted data set. This option should be used only with a data set that is backed up, or with a data set that you can reconstruct. For more information, see the SORT procedure in the Base SAS Procedures Guide.


Performance Tuning for PROC SORT


How SAS Determines the Amount of Memory to Use

The MEMSIZE system option controls the upper limit for the maximum amount of memory that is available to the SAS process. The SORTSIZE system option limits the amount of memory that is available to PROC SORT. The REALMEMSIZE system option specifies the amount of real (not virtual) memory that will be made available to SAS.

While memory settings below the default values for MEMSIZE and SORTSIZE might adversely affect sorting and SAS performance, making large amounts of memory available might be of no benefit. The key for determining whether additional memory might improve performance is whether the sort will fit in memory. If the sorted file requires more memory than is allocated, then a SORTSIZE value in the range of 64-128M is generally the optimal value. SORTSIZE should always be set to a value that is at least 8M smaller than MEMSIZE.

For information about setting the REALMEMSIZE system option, see Guidelines for Setting the REALMEMSIZE System Option.

Note:   If you receive an out of memory error, then increase the value of MEMSIZE. For more information, see MEMSIZE System Option: UNIX.  [cautionend]


Guidelines for Setting the REALMEMSIZE System Option

Because PROC SORT might use the REALMEMSIZE system option to determine how much memory to use, it is important that the REALMEMSIZE value reflects the amount of memory that is available on your system. The default value is 80% of the MEMSIZE setting. If REALMEMSIZE is set too high, then PROC SORT might use more memory than is actually available. Using too much memory will cause excessive paging and adversely impact system performance.

In general, REALMEMSIZE should be set to the amount of physical memory (not including swap space) that you expect to be available to SAS at run time. A good starting value is the amount of physical memory installed on the computer less the amount that is being used by running applications and the operating system. You can experiment with the REALMEMSIZE value until you reach optimum performance for your environment. In some cases, optimum performance can be achieved with a very low REALMEMSIZE value. A low value could cause SAS to use less memory and leave more memory for the operating system to perform I/O caching.

For more information, see REALMEMSIZE System Option: UNIX.


Using Other Options that Affect Performance

The THREADS system option controls whether threaded procedures will use threads. It is available as both a system option and as a procedural override in PROC SORT.

The CPUCOUNT option is directly related to the THREADS option and defaults to the number of CPUs on your computer. Depending on your file system and the number of concurrent users, you might benefit from lowering the CPUCOUNT on machines that have many CPUs. When the value of CPUCOUNT equals ACTUAL, SAS returns the number of physical CPUs that are associated with the operating environment where SAS is executing.

The UTILLOC system option allows for the spreading of utility files, and is a good option for balancing I/O.

The DETAILS option, specified in the PROC SORT statement, causes PROC SORT to write messages to the SAS log detailing whether the sort was performed in memory. If the sort was not performed in memory, then the details that are written include the number of utility files and their sizes.

For more information about the THREADS, CPUCOUNT, and UTILLOC system options see the SAS Language Reference: Dictionary.


Creating Your Own Collating Sequences

If you want to provide your own collating sequences or change a collating sequence provided for you, use the TRANTAB procedure to create or modify translation tables. For more information, see the TRANTAB procedure in the SAS National Language Support (NLS): Reference Guide. When you create your own translation tables, they are stored in your Sasuser.Profile catalog, and they override any translation tables by the same name that are stored in the Host catalog.

Note:   System managers can modify the Host catalog by copying newly created tables from the Profile catalog to the Host catalog. Then, all users can access the new or modified translation table.  [cautionend]

If you are using the SAS windowing environment and want to see the names of the collating sequences that are stored in the Host catalog, issue the following command from any window:

catalog sashelp.host

If you are not using the SAS windowing environment, then issue the following statements to generate a list of the contents in the Host catalog:

proc catalog catalog=sashelp.host;
contents;
run;

Entries of type TRANTAB are the collating sequences.

To see the contents of a particular translation table, use the following statements:

proc trantab table=table-name;
list;
run;

The contents of collating sequences are displayed in the SAS log.


Specifying the Host Sort Utility


Introduction to Using the Host Sort

SAS supports one host sort utility on UNIX called syncsort . You can use this sorting application as an alternative sorting algorithm to the SAS sort. SAS determines which sort to use by the values that are set for the SORTNAME, SORTPGM, SORTCUT, and SORTCUTP system options.


Setting the Host Sort Utility as the Sort Algorithm

To specify a host sort utility as the sort algorithm, complete the following steps:

  1. Specify the name of the host utility (syncsort ) in the SORTNAME system option.

  2. Set the SORTPGM system option to tell SAS when to use the host sort utility.

    • If you specify SORTPGM=HOST, then SAS always prefers to use the host sort utility.

    • If you specify SORTPGM=BEST, then SAS chooses the best sorting method (either the SAS sort or the host sort) for the situation.


Sorting Based on Size or Observations

The sort routine that SAS uses can be based on either the number of observations in a data set or on the size of the data set. When the SORTPGM system option is set to BEST, SAS uses the first available and pertinent sorting algorithm based on this order of precedence:

SAS looks at the values for the SORTCUT and SORTCUTP system options to determine which sort to use.

The SORTCUT option specifies the number of observations above which the host sort utility is used instead of the SAS sort. The SORTCUTP option specifies the number of bytes in the data set above which the host sort utility is used.

If SORTCUT and SORTCUTP are set to zero, SAS uses the SAS sort routine. If you specify both options and either condition is met, SAS uses the host sort utility.

When the following OPTIONS statement is in effect, the host sort utility is used when the number of observations is 501 or greater:

options sortpgm=best sortcut=500;

In this example, the host sort utility is used when the size of the data set is greater than 40M:

options sortpgm=best sortcutp=40M;

For more information about these sort options, see SORTCUT System Option: UNIX, SORTCUTP System Option: UNIX, and SORTPGM System Option: UNIX.


Changing the Location of Temporary Files Used by the Host Sort Utility

By default, the host sort utilities use the location that is specified in the -WORK option for temporary files. To change the location of these temporary files, specify a location by using the SORTDEV system option. Here is an example:

options sortdev="/tmp/host";

For more information, see SORTDEV System Option: UNIX.


Passing Options to the Host Sort Utility

To specify options for the sort utility, use the SORTANOM system option. For a list of valid options, see SORTANOM System Option: UNIX.


Passing Parameters to the Host Sort Utility

To pass parameters to the sort utility, use the SORTPARM system option. The parameters that you can specify depend on the host sort utility. For more information, see SORTPARM System Option: UNIX.


Specifying the SORTSEQ= Option with a Host Sort Utility

The SORTSEQ= option enables you to specify the collating sequence for your sort. For a list of valid values, see the SORT procedure in Base SAS Procedures Guide.

CAUTION:
If you are using a host sort utility to sort your data, then specifying the SORTSEQ= option might corrupt the character BY variables if the sort sequence translation table and its inverse are not one-to-one mappings.

In other words, for the sort to work, the translation table must map each character to a unique weight, and the inverse table must map each weight to a unique character variable.  [cautionend]

If your translation tables do not map one to one, then you can use one of the following methods to perform your sort:

Note:   After using one of these methods, you might need to perform subsequent BY processing using either the NOTSORTED option or the NOBYSORTED system option. For more information about the NOTSORTED option, see BY Statement in SAS Language Reference: Dictionary. For more information about the NOBYSORTED system option, see BYSORTED System Option in SAS Language Reference: Dictionary.  [cautionend]


Example: Creating a View with a Dummy BY Variable

The following example shows how to create a view using a dummy BY variable. SAS uses the BEST argument in the SORTPGM system option to sort the data. By using BEST, SAS selects either the host sort or the SAS sort.

options nodate pageno=1 nostimer ls=78 ps=60;
options sortpgm=best msglevel=i;

data one;
   input name $ age;
   datalines;
anne 35
ALBERT 10
JUAN 90
janet 5
bridget 23
BRIAN 45
;

data oneview / view=oneview;
   set one;
   name1=upcase(name);
run;

proc sort data=oneview out=final(drop=name1);
   by name1;
run;

proc print data=final;
run;

SAS writes the following note about sort to the log:

NOTE: SAS threaded sort was used.

SAS creates the following output:

Creating a View with a Dummy BY Variable

                                The SAS System                               1

                            Obs    name       age

                             1     ALBERT      10
                             2     anne        35
                             3     BRIAN       45
                             4     bridget     23
                             5     janet        5
                             6     JUAN        90

See Also

Previous Page | Next Page | Top of Page