SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 31369: Sorting Text Without Regard to Case in SAS 9.2

DetailsAboutRate It

PROC SORT is a frequently used Base SAS procedure with simple syntax. Unfortunately, anomalies in the key variables, especially mixed case character values, can present special challenges.

By default, uppercase and lowercase letters are sorted based on their internal storage representation, not their position in the alphabet. For the ASCII-based encodings used in Windows and UNIX, capital Z precedes lowercase a, but for EBCDIC-based encodings, lowercase z precedes capital A. Therefore, sorting mixed case text may produce unexpected, potentially inconsistent results.

     /* Sort the original data set */
   proc sort data=maps.names out=territories;
     by Territory  Name;
     where Territory contains 'territory of France';
   run;
   proc print data=territories;
     var Name Territory;
     title1 'Overseas French Territories';
     title2 'No adjustment for case of text';
   run;
   

output example 1

A traditional technique to remove case sensitivity from character comparisons is to apply the UPCASE function. To accomplish this when sorting data, a copy of the key variable(s) must be created, with each value converted to uppercase. The converted values are used as the keys in the BY statement of PROC SORT.

     /* Create a copy of the key variable in uppercase */
   data territories;
     set maps.names;
     where Territory contains 'territory of France';
     Territory_Upper=upcase(Territory);
   run;
     /* Sort by the new variable */
   proc sort data=territories;
     by Territory_Upper  Name;
   run; 
   

In SAS 9.2 the SORTSEQ= LINGUISTIC option is supported in PROC SORT. Linguistic collation sorts based on the language rules for the current locale setting.

A primary linguistic collation rule is to treat alphabetic characters equally regardless of case. This simple rule is adequate for many applications.

Secondary rules include distinctions based on diacritical (accent) marks and separation of uppercase and lowercase letters within the groups defined by the primary rule.

     /* Sort the original data set using SORTSEQ=LINGUISTIC */
   proc sort data=maps.names  out=territories
             sortseq=linguistic(strength=primary);
     by Territory name;
     where Territory contains 'territory of France';
   run;
   proc print data=territories;
     var Name Territory;
     title1 'Overseas French Territories';
     title2 'Using Linguistic Sorting Rules (Primary only)';
   run; 
   

outpute example 2



Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemBase SASz/OS9.2 TS1M0
Microsoft® Windows® for 64-Bit Itanium-based Systems9.2 TS1M0
Microsoft Windows Server 2003 Datacenter 64-bit Edition9.2 TS1M0
Microsoft Windows Server 2003 Enterprise 64-bit Edition9.2 TS1M0
Microsoft Windows XP 64-bit Edition9.2 TS1M0
Microsoft® Windows® for x649.2 TS1M0
Microsoft Windows 2000 Advanced Server9.2 TS1M0
Microsoft Windows 2000 Datacenter Server9.2 TS1M0
Microsoft Windows 2000 Server9.2 TS1M0
Microsoft Windows 2000 Professional9.2 TS1M0
Microsoft Windows Server 2003 Datacenter Edition9.2 TS1M0
Microsoft Windows Server 2003 Enterprise Edition9.2 TS1M0
Microsoft Windows Server 2003 Standard Edition9.2 TS1M0
Microsoft Windows XP Professional9.2 TS1M0
Windows Vista9.2 TS1M0
64-bit Enabled AIX9.2 TS1M0
64-bit Enabled HP-UX9.2 TS1M0
64-bit Enabled Solaris9.2 TS1M0
HP-UX IPF9.2 TS1M0
Linux9.2 TS1M0
Linux for x649.2 TS1M0
OpenVMS on HP Integrity9.2 TS1M0
Solaris for x649.2 TS1M0
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.