Working with Grouped or Sorted Observations |
Understanding Sorted Data |
By default, groups appear in ascending order of the BY values. In some cases you want to emphasize the order in which the observations are sorted, not the fact that they can be grouped. For example, you may want to alphabetize the tours by country.
To sort your data in a particular order, use the SORT procedure just as you do for grouped data. When the sorted order is more important than the grouping, you usually want only one observation with a given BY value in the resulting data set. Therefore, you may need to remove duplicate observations.
Operating Environment Information: The SORT procedure accesses either a sorting utility that is supplied as part of SAS, or a sorting utility supplied by the host operating system. All examples in this documentation use the SAS sorting utility. Some operating system utilities do not accept particular options, including the NODUPRECS option described later in this section. The default sorting utility is set by your site. For more information about the utilities available to you, see the documentation for your operating system.
Sorting Data |
The following example sorts data set MYLIB.ARCH_OR_SCEN by COUNTRY:
options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=bycountry; by Country; run; proc print data=bycountry; title 'Tours in Alphabetical Order by Country'; run;
The following output displays the results:
Tours in Alphabetical Order by Country 1 Land Obs Country TourType Nights Cost Vendor 1 France architecture 8 575 World 2 Greece scenery 12 698 Express 3 Ireland scenery 7 558 Express 4 Italy architecture 8 468 Express 5 Japan architecture 8 720 Express 6 New Zealand scenery 16 1489 Southsea 7 Spain architecture 10 510 World 8 Switzerland scenery 9 734 World
Deleting Duplicate Observations |
You can eliminate duplicate observations in a SAS data set by using the NODUPRECS option with the SORT procedure. The following programs show you how to create a SAS data set and then remove duplicate observations.
The external file shown below contains a duplicate observation for Switzerland:
Spain architecture 10 510 World Japan architecture 8 720 Express Switzerland scenery 9 734 World France architecture 8 575 World Switzerland scenery 9 734 World Ireland scenery 7 558 Express New Zealand scenery 16 1489 Southsea Italy architecture 8 468 Express Greece scenery 12 698 Express
The following DATA step creates a permanent SAS data set named MYLIB.ARCH_OR_SCEN2.
options pagesize=60 linesize=80 pageno=1 nodate; libname mylib 'SAS-data-library'; data mylib.arch_or_scen2; infile 'input-file'; input Country $ 1-11 TourType $ 13-24 Nights LandCost Vendor $; run; proc print data=mylib.arch_or_scen2; title 'Data Set MYLIB.ARCH_OR_SCEN2'; run;
The following output shows that this data set contains a duplicate observation for Switzerland:
Data Set MYLIB.ARCH_OR_SCEN2 1 Land Obs Country TourType Nights Cost Vendor 1 Spain architecture 10 510 World 2 Japan architecture 8 720 Express 3 Switzerland scenery 9 734 World 4 France architecture 8 575 World 5 Switzerland scenery 9 734 World 6 Ireland scenery 7 558 Express 7 New Zealand scenery 16 1489 Southsea 8 Italy architecture 8 468 Express 9 Greece scenery 12 698 Express
The following program uses the NODUPRECS option in the SORT procedure to delete duplicate observations. The program creates a new data set called FIXED.
options pagesize=60 linesize=80 pageno=1 nodate; proc sort data=mylib.arch_or_scen out=fixed noduprecs; by Country; run; proc print data=fixed; title 'Data Set FIXED: MYLIB.ARCH_OR_SCEN2 With Duplicates Removed'; run;
The following output displays messages that appear in the SAS log:
Partial SAS Log Indicating Duplicate Observations Deleted
311 options pagesize=60 linesize=80 pageno=1 nodate; 312 proc sort data=mylib.arch_or_scen out=fixed noduprecs; 313 by Country; 314 run; NOTE: 1 duplicate observations were deleted. NOTE: There were 9 observations read from the data set MYLIB.ARCH_OR_SCEN. NOTE: The data set WORK.FIXED has 8 observations and 5 variables. 315 316 proc print data=fixed; 317 title 'Data Set FIXED: MYLIB.ARCH_OR_SCEN2 With Duplicates Removed'; 318 run; NOTE: There were 8 observations read from the data set WORK.FIXED.
The following output shows the results of the NODUPRECS option:
Data Set FIXED with No Duplicate Observations
Data Set FIXED: MYLIB.ARCH_OR_SCEN2 With Duplicates Removed 1 Land Obs Country TourType Nights Cost Vendor 1 France architecture 8 575 World 2 Greece scenery 12 698 Express 3 Ireland scenery 7 558 Express 4 Italy architecture 8 468 Express 5 Japan architecture 8 720 Express 6 New Zealand scenery 16 1489 Southsea 7 Spain architecture 10 510 World 8 Switzerland scenery 9 734 World
Understanding Collating Sequences |
Both numeric and character variables can be sorted into ascending or descending order. For numeric variables, ascending or descending order is easy to understand, but what about the order of characters? Character values include uppercase and lowercase letters, special characters, and the digits 0 through 9 when they are treated as characters rather than as numbers. How does SAS sort these characters?
The order in which characters sort is called a collating sequence. By default, SAS sorts characters in one of two sequences: EBCDIC or ASCII, depending on the operating environment under which SAS is running. For reference, both sequences are displayed here.
As long as you work under a single operating system, you seldom need to think about the details of collating sequences. However, when you transfer files from an operating system using EBCDIC to an operating system using ASCII or vice versa, character values that are sorted on one operating system are not necessarily in the correct order for the other operating system. The simplest solution to the problem is to re-sort character data (not numeric data) on the destination operating system. For detailed information about collating sequences, see the documentation for your operating environment.
The following operating systems use the ASCII collating sequence:
From the smallest to the largest displayable character, the English-language ASCII sequence is
The main features of the ASCII sequence are that digits are smaller than uppercase letters and uppercase letters are smaller than lowercase ones. The blank is the smallest displayable character, followed by the other types of characters:
blank < digits < uppercase letters < lowercase letters
The following operating systems use the EBCDIC collating sequence:
From the smallest to largest displayable character, the English-language EBCDIC sequence is
The main features of the EBCDIC sequence are that lowercase letters are smaller than uppercase letters and uppercase letters are smaller than digits. The blank is the smallest displayable character, followed by the other types of characters:
blank < lowercase letters < uppercase letters < digits
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.