SAS Data Sets |
The Sort Indicator |
After a data set is sorted, a sort indicator is added to the data set descriptor information. The sort indicator is updated without a permanent sort of the data set by using the SORTEDBY= data set option. The Sortedby and Validated sort information is updated when the SORTEDBY= data set option is used.
The sort indicator contains some or all of the following sort information of a SAS data set:
how the data set is sorted by which variable or variables
whether the order for a variable is descending or ascending
the character set used for character variables
the collating sequence used for ordering character data
collation rules if the data set is sorted linguistically
whether there is only one observation for any given BY group (use of NODUPKEY option)
whether there are no adjacent duplicate observations (use of NODUPREC option)
whether the data set is validated
The sort indicator is set when a data set is sorted by a SORT procedure, an SQL procedure with an ORDER BY clause, a DATASETS procedure MODIFY statement, or a SORTEDBY= data set option. If the SORT or SQL procedures were used to sort the data set, which is being sorted by SAS, the CONTENTS procedure output indicates that the Validated sort information is YES. If the SORTEDBY= data set option was used to sort the data set, which is being sorted by the user, the CONTENTS procedure output indicates the Validated sort information is set to NO and the Sortedby sort information is updated with the variable or variables specified by the data set option.
Data sets can be sorted outside of SAS. In that case, you might use the SORTEDBY= data set option or the DATASETS procedure MODIFY statement to add the sort order to the sort indicator. In this case, they are not validated. For more information, see Validating That a Data Set Is Sorted.
To view the sort indicator information, use the CONTENTS procedure or the CONTENTS statement in the DATASETS procedure. The following three examples show the sort indicator information in the CONTENTS procedure output.
The first example is a data set created without any type of sort:
options yearcutoff=1920; data sasuser.sorttest1; input priority 1. +1 indate date7. +1 office $ code $; format indate date7.; datalines; 1 03may07 CH J8U 1 21mar07 LA M91 1 01dec07 FW L6R 1 27feb06 FW Q2A 2 15jan07 FW I9U 2 09jul07 CH P3Q 3 08apr06 CH H5T 3 31jan06 FW D2W ; proc contents data=sasuser.sorttest1; run;
Note that the CONTENTS procedure output indicates there was no sort. SAS did not sort the data set, and the user did not specify that the data is sorted.
Contents of SORTTEST1 Data Set -- No Sorting
The SAS System 12:47 Wednesday, October 3, 2007 1
The CONTENTS Procedure
Data Set Name SASUSER.SORTTEST1 Observations 8
Member Type DATA Variables 4
Engine V9 Indexes 0
Created Wed, Oct 03, 2007 12:55:59 PM Observation Length 32
Last Modified Wed, Oct 03, 2007 12:55:59 PM Deleted Observations 0
Protection Compressed NO
Data Set Type Sorted NO
Label
Data Representation WINDOWS_32
Encoding wlatin1 Western (Windows)
Engine/Host Dependent Information
Data Set Page Size 4096
Number of Data Set Pages 1
First Data Page 1
Max Obs per Page 126
Obs in First Data Page 8
Number of Data Set Repairs 0
Filename C:\Documents and Settings\suholm\sasuser\sorttest1.sas7bdat
Release Created 9.0201B0
Host Created XP_PRO
Alphabetic List of Variables and Attributes
# Variable Type Len Format
4 code Char 8
2 indate Num 8 DATE7.
3 office Char 8
1 priority Num 8
In the second example, the data set is created using the SORTEDBY= data set option in the DATA statement.
options yearcutoff=1920; data sasuser.sorttest1 (sortedby=priority descending indate); input priority 1. +1 indate date7. +1 office $ code $; format indate date7.; datalines; 1 03may01 CH J8U 1 21mar01 LA M91 1 01dec00 FW L6R 1 27feb99 FW Q2A 2 15jan00 FW I9U 2 09jul99 CH P3Q 3 08apr99 CH H5T 3 31jan99 FW D2W ; proc contents data=sasuser.sorttest1; run;
Note that the CONTENTS procedure output shows that the data set is sorted. Therefore, a Sort Information section containing sort indicator information is created. In the Sort Information section, the Sortedby information indicates the data set is sorted by the PRIORITY variable and is in descending order by the INDATE variable. The data set is sorted using the SORTEDBY= data set option, so the Validated information is NO. The Character Set information for the data set is ANSI.
Contents of SORTTEST1 Data Set - Sorted
The SAS System 12:47 Wednesday, October 3, 2007 2 The CONTENTS Procedure Data Set Name SASUSER.SORTTEST1 Observations 8 Member Type DATA Variables 4 Engine V9 Indexes 0 Created Wed, Oct 03, 2007 01:02:55 PM Observation Length 32 Last Modified Wed, Oct 03, 2007 01:02:55 PM Deleted Observations 0 Protection Compressed NO Data Set Type Sorted YES Label Data Representation WINDOWS_32 Encoding wlatin1 Western (Windows) Engine/Host Dependent Information Data Set Page Size 4096 Number of Data Set Pages 1 First Data Page 1 Max Obs per Page 126 Obs in First Data Page 8 Number of Data Set Repairs 0 Filename C:\Documents and Settings\suholm\sasuser\sorttest1.sas7bdat Release Created 9.0201B0 Host Created XP_PRO Alphabetic List of Variables and Attributes # Variable Type Len Format 4 code Char 8 2 indate Num 8 DATE7. 3 office Char 8 1 priority Num 8 Sort Information Sortedby priority DESCENDING indate Validated NO Character Set ANSI
In the third example, the data set is sorted using the SORT procedure.
options yearcutoff=1920; data sasuser.sorttest1; input priority 1. +1 indate date7. +1 office $ code $; format indate date7.; datalines; 1 03may01 CH J8U 1 21mar01 LA M91 1 01dec00 FW L6R 1 27feb99 FW Q2A 2 15jan00 FW I9U 2 09jul99 CH P3Q 3 08apr99 CH H5T 3 31jan99 FW D2W ; proc sort data=sasuser.sorttest1; by priority descending indate;run; proc contents data=sasuser.sorttest1; run;
Note that the CONTENTS procedure output shows that the data set is sorted. Therefore, a Sort Information section containing sort indicator information is created. In the Sort Information section, the Sortedby information indicates the data set is sorted by the PRIORITY variable and is in descending order by the INDATE variable. The data set is sorted using the SORT procedure, so the Validated information is YES. The Character Set information for the data set is ANSI.
Contents of SORTTEST1 Data Set - Validated Sort
The SAS System 12:47 Wednesday, October 3, 2007 4 The CONTENTS Procedure Data Set Name SASUSER.SORTTEST1 Observations 8 Member Type DATA Variables 4 Engine V9 Indexes 0 Created Wed, Oct 03, 2007 01:08:35 PM Observation Length 32 Last Modified Wed, Oct 03, 2007 01:08:35 PM Deleted Observations 0 Protection Compressed NO Data Set Type Sorted YES Label Data Representation WINDOWS_32 Encoding wlatin1 Western (Windows) Engine/Host Dependent Information Data Set Page Size 4096 Number of Data Set Pages 1 First Data Page 1 Max Obs per Page 126 Obs in First Data Page 8 Number of Data Set Repairs 0 Filename C:\Documents and Settings\suholm\sasuser\sorttest1.sas7bdat Release Created 9.0201B0 Host Created XP_PRO Alphabetic List of Variables and Attributes # Variable Type Len Format 4 code Char 8 2 indate Num 8 DATE7. 3 office Char 8 1 priority Num 8 Sort Information Sortedby priority DESCENDING indate Validated YES Character Set ANSI
How SAS Uses the Sort Indicator to Improve Performance |
The sort information provided by the sort indicator is used internally for performance improvements. There are several ways to improve performance using the sort indicator:
SAS uses the sort indicator to validate whether there was a previous sort. If there was a previous sort from a SORT procedure or an SQL procedure with an ORDER BY clause, then SAS does not perform another sort.
The SORT procedure sets the sort indicator when a sort occurs. The SORT procedure checks for the sort indicator before it sorts a data set so that data is not resorted unnecessarily. For more information, see the SORT Procedure in Base SAS Procedures Guide.
The SQL procedure uses the sort indicator to process queries more efficiently and to determine whether an internal sort is necessary before performing a join. For more information, see the SQL Procedure in Base SAS Procedures Guide.
When using the sort indicator during index creation, SAS determines whether the data is already sorted by the key variable or variables in ascending order by checking the sort indicator in the data file. If the values in the sort indicator are in ascending order, SAS does not sort the values for the index file. For more information, see Overview of Creating Indexes.
When processing a WHERE expression without an index, SAS first checks the sort indicator. If the Validated sort information is YES, SAS stops reading the file once there are no more values that satisfy the WHERE expression.
If an index is selected for WHERE expression processing, the sort indicator for that data set is changed to reflect the order that is specified by the index.
For BY-group processing, if the data set is already sorted by the BY variable, SAS does not use the index, even if the data set is indexed on the BY variable.
If the Validated sort information is set to YES, SAS does not need to perform another sort.
Validating That a Data Set Is Sorted |
Any SAS procedure that requires data to be sorted as part of the process checks the sort indicator information. The sort indicator is set when a data set is sorted by a SORT procedure, an SQL procedure with an ORDER BY clause, a DATASETS procedure MODIFY statement, or a SORTEDBY= data set option. If the SORT or SQL procedures were used to sort the data set, the CONTENTS procedure output indicates the Validated sort information is YES. If the SORTEDBY= data set option was used to sort the data set, the CONTENTS procedure output indicates the Validated sort information is NO. For examples of the CONTENTS procedure output, see Example 1: Using No Sorting, Example 2: Using the SORTEDBY= Data Set Option, and Example 3: Using the SORT Procedure.
You can use the SORTVALIDATE system option to specify whether the SORT procedure validates that a data set is sorted correctly when the data set sort indicator shows that a user has specified that the data set is sorted. The user can specify a sort order by using the SORTEDBY= data set option in a DATA statement or by using the SORTEDBY= option in the DATASETS procedure MODIFY statement. When the sort indicator is set by a user, SAS cannot be absolutely certain that a data set is sorted according to the variables in the BY statement.
If the SORTVALIDATE system option is set and the data set sort indicator was set by a user, the SORT procedure performs a sequence check on each observation to ensure that the data set is sorted according to the variables in the BY statement. If the data set is not sorted correctly, SAS sorts the data set.
At the end of a successful sequence check or at the end of a sort, the SORT procedure sets the sort indicator Validated sort information to YES. If a sort is performed, the SORT procedure also updates the sort indicator Sortedby sort information to the variables in the BY statement. If an output data set is specified, the sort indicator Validated sort information in the output data set is set to YES. If no sort is necessary, the data set is copied to the output data set. For more information about validated data sets, see the SORTVALIDATE= System Option in SAS Language Reference: Dictionary.
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.