Usage Note 31369: Sorting Text Without Regard to Case in SAS 9.2
PROC SORT is a frequently used Base SAS procedure with simple syntax. Unfortunately, anomalies in the key variables, especially mixed case character values, can present special challenges.
By default, uppercase and lowercase letters are sorted based on their internal storage representation, not their position in the alphabet. For the ASCII-based encodings used in Windows and UNIX, capital Z precedes lowercase a, but for EBCDIC-based encodings, lowercase z precedes capital A. Therefore, sorting mixed case text may produce unexpected, potentially inconsistent results.
/* Sort the original data set */
proc sort data=maps.names out=territories;
by Territory Name;
where Territory contains 'territory of France';
run;
proc print data=territories;
var Name Territory;
title1 'Overseas French Territories';
title2 'No adjustment for case of text';
run;
A traditional technique to remove case sensitivity from character comparisons is to apply the UPCASE function. To accomplish this when sorting data, a copy of the key variable(s) must be created, with each value converted to uppercase. The converted values are used as the keys in the BY statement of PROC SORT.
/* Create a copy of the key variable in uppercase */
data territories;
set maps.names;
where Territory contains 'territory of France';
Territory_Upper=upcase(Territory);
run;
/* Sort by the new variable */
proc sort data=territories;
by Territory_Upper Name;
run;
In SAS 9.2 the SORTSEQ= LINGUISTIC option is supported in PROC SORT. Linguistic collation sorts based on the language rules for the current locale setting.
A primary linguistic collation rule is to treat alphabetic characters equally regardless of case. This simple rule is adequate for many applications.
Secondary rules include distinctions based on diacritical (accent) marks and separation of uppercase and lowercase letters within the groups defined by the primary rule.
/* Sort the original data set using SORTSEQ=LINGUISTIC */
proc sort data=maps.names out=territories
sortseq=linguistic(strength=primary);
by Territory name;
where Territory contains 'territory of France';
run;
proc print data=territories;
var Name Territory;
title1 'Overseas French Territories';
title2 'Using Linguistic Sorting Rules (Primary only)';
run;
Operating System and Release Information
SAS System | Base SAS | z/OS | 9.2 TS1M0 | |
Microsoft® Windows® for 64-Bit Itanium-based Systems | 9.2 TS1M0 | |
Microsoft Windows Server 2003 Datacenter 64-bit Edition | 9.2 TS1M0 | |
Microsoft Windows Server 2003 Enterprise 64-bit Edition | 9.2 TS1M0 | |
Microsoft Windows XP 64-bit Edition | 9.2 TS1M0 | |
Microsoft® Windows® for x64 | 9.2 TS1M0 | |
Microsoft Windows 2000 Advanced Server | 9.2 TS1M0 | |
Microsoft Windows 2000 Datacenter Server | 9.2 TS1M0 | |
Microsoft Windows 2000 Server | 9.2 TS1M0 | |
Microsoft Windows 2000 Professional | 9.2 TS1M0 | |
Microsoft Windows Server 2003 Datacenter Edition | 9.2 TS1M0 | |
Microsoft Windows Server 2003 Enterprise Edition | 9.2 TS1M0 | |
Microsoft Windows Server 2003 Standard Edition | 9.2 TS1M0 | |
Microsoft Windows XP Professional | 9.2 TS1M0 | |
Windows Vista | 9.2 TS1M0 | |
64-bit Enabled AIX | 9.2 TS1M0 | |
64-bit Enabled HP-UX | 9.2 TS1M0 | |
64-bit Enabled Solaris | 9.2 TS1M0 | |
HP-UX IPF | 9.2 TS1M0 | |
Linux | 9.2 TS1M0 | |
Linux for x64 | 9.2 TS1M0 | |
OpenVMS on HP Integrity | 9.2 TS1M0 | |
Solaris for x64 | 9.2 TS1M0 | |
*
For software releases that are not yet generally available, the Fixed
Release is the software release in which the problem is planned to be
fixed.
This note reviews aseemingly simple problem that gets substantially easier to solve using new features of the Base SAS 9.2 SORT procedure. The note is adapted from
Don't Be a SAS Dinosaur: Modernize Your SAS Code by Warren Repole, SAS Institute.
Type: | Usage Note |
Priority: | |
Topic: | SAS Reference ==> Procedures ==> SORT
|
Date Modified: | 2008-03-25 13:14:15 |
Date Created: | 2008-03-03 14:17:30 |