Working with Grouped or Sorted Observations

Working with Sorted Data

By default, groups appear in ascending order of the BY values. In some cases you want to emphasize the order in which the observations are sorted, not the fact that they can be grouped. For example, you may want to alphabetize the tours by country.

To sort your data in a particular order, use the SORT procedure just as you do for grouped data. When the sorted order is more important than the grouping, you usually want only one observation with a given BY value in the resulting data set. Therefore, you may need to remove duplicate observations.

Operating Environment Information: The SORT procedure accesses either a sorting utility that is supplied as part of SAS, or a sorting utility supplied by the host operating system. All examples in this documentation use the SAS sorting utility. Some operating system utilities do not accept particular options, including the NODUPRECS option described later in this section. The default sorting utility is set by your site. For more information about the utilities available to you, see the documentation for your operating system. [cautionend]

Sorting Data

The following example sorts data set MYLIB.ARCH_OR_SCEN by COUNTRY:

options pagesize=60 linesize=80 pageno=1 nodate;
proc sort data=mylib.arch_or_scen out=bycountry;
   by Country;
run;

proc print data=bycountry;
   title 'Tours in Alphabetical Order by Country';
run;

The following output displays the results:

Sorting Data

                     Tours in Alphabetical Order by Country                    1

                                                        Land
        Obs    Country        TourType        Nights    Cost    Vendor

         1     France         architecture       8       575    World   
         2     Greece         scenery           12       698    Express 
         3     Ireland        scenery            7       558    Express 
         4     Italy          architecture       8       468    Express 
         5     Japan          architecture       8       720    Express 
         6     New Zealand    scenery           16      1489    Southsea
         7     Spain          architecture      10       510    World   
         8     Switzerland    scenery            9       734    World

Deleting Duplicate Observations

You can eliminate duplicate observations in a SAS data set by using the NODUPRECS option with the SORT procedure. The following programs show you how to create a SAS data set and then remove duplicate observations.

The external file shown below contains a duplicate observation for Switzerland:

Spain       architecture  10  510 World
Japan       architecture   8  720 Express
Switzerland scenery        9  734 World
France      architecture   8  575 World
Switzerland scenery        9  734 World
Ireland     scenery        7  558 Express
New Zealand scenery       16 1489 Southsea
Italy       architecture   8  468 Express
Greece      scenery       12  698 Express

The following DATA step creates a permanent SAS data set named MYLIB.ARCH_OR_SCEN2.

options pagesize=60 linesize=80 pageno=1 nodate;

libname mylib 'SAS-data-library';
data mylib.arch_or_scen2;
   infile 'input-file';
   input Country $ 1-11 TourType $ 13-24 Nights LandCost Vendor $;
run;

proc print data=mylib.arch_or_scen2;
   title 'Data Set MYLIB.ARCH_OR_SCEN2';
run;

The following output shows that this data set contains a duplicate observation for Switzerland:

Data Set MYLIB.ARCH_OR_SCEN2

                          Data Set MYLIB.ARCH_OR_SCEN2                         1

                                                        Land
        Obs    Country        TourType        Nights    Cost    Vendor

         1     Spain          architecture      10       510    World   
         2     Japan          architecture       8       720    Express 
         3     Switzerland    scenery            9       734    World   
         4     France         architecture       8       575    World   
         5     Switzerland    scenery            9       734    World   
         6     Ireland        scenery            7       558    Express 
         7     New Zealand    scenery           16      1489    Southsea
         8     Italy          architecture       8       468    Express 
         9     Greece         scenery           12       698    Express

The following program uses the NODUPRECS option in the SORT procedure to delete duplicate observations. The program creates a new data set called FIXED.

options pagesize=60 linesize=80 pageno=1 nodate;
proc sort data=mylib.arch_or_scen out=fixed noduprecs;
    by Country;
run;

proc print data=fixed;
   title 'Data Set FIXED: MYLIB.ARCH_OR_SCEN2 With Duplicates Removed';
run;

The following output displays messages that appear in the SAS log:

Partial SAS Log Indicating Duplicate Observations Deleted

311  options pagesize=60 linesize=80 pageno=1 nodate;
312  proc sort data=mylib.arch_or_scen out=fixed noduprecs;
313      by Country;
314  run;
NOTE: 1 duplicate observations were deleted.
NOTE: There were 9 observations read from the data set MYLIB.ARCH_OR_SCEN.
NOTE: The data set WORK.FIXED has 8 observations and 5 variables.
315  
316  proc print data=fixed;
317     title 'Data Set FIXED: MYLIB.ARCH_OR_SCEN2 With Duplicates Removed';
318  run;
NOTE: There were 8 observations read from the data set WORK.FIXED.

The following output shows the results of the NODUPRECS option:

Data Set FIXED with No Duplicate Observations

          Data Set FIXED: MYLIB.ARCH_OR_SCEN2 With Duplicates Removed          1

                                                        Land
        Obs    Country        TourType        Nights    Cost    Vendor

         1     France         architecture       8       575    World   
         2     Greece         scenery           12       698    Express 
         3     Ireland        scenery            7       558    Express 
         4     Italy          architecture       8       468    Express 
         5     Japan          architecture       8       720    Express 
         6     New Zealand    scenery           16      1489    Southsea
         7     Spain          architecture      10       510    World   
         8     Switzerland    scenery            9       734    World

Understanding Collating Sequences

Both numeric and character variables can be sorted into ascending or descending order. For numeric variables, ascending or descending order is easy to understand, but what about the order of characters? Character values include uppercase and lowercase letters, special characters, and the digits 0 through 9 when they are treated as characters rather than as numbers. How does SAS sort these characters?

The order in which characters sort is called a collating sequence. By default, SAS sorts characters in one of two sequences: EBCDIC or ASCII, depending on the operating environment under which SAS is running. For reference, both sequences are displayed here.

As long as you work under a single operating system, you seldom need to think about the details of collating sequences. However, when you transfer files from an operating system using EBCDIC to an operating system using ASCII or vice versa, character values that are sorted on one operating system are not necessarily in the correct order for the other operating system. The simplest solution to the problem is to re-sort character data (not numeric data) on the destination operating system. For detailed information about collating sequences, see the documentation for your operating environment.

ASCII Collating Sequence

The following operating systems use the ASCII collating sequence:

	Macintosh
	MS-DOS
	OpenVMS
	OS/2
	PC DOS
	UNIX and its derivatives
	Windows

From the smallest to the largest displayable character, the English-language ASCII sequence is

	blank!"#$%&'()*+,- ./0123456789:;<=>?@
	ABCDEFGHIJKLMNOPQRSTUVWXYZ [ \ ] ˆ_
	abcdefghijklmnopqrstuvwxyz{}~

The main features of the ASCII sequence are that digits are smaller than uppercase letters and uppercase letters are smaller than lowercase ones. The blank is the smallest displayable character, followed by the other types of characters:

blank < digits < uppercase letters < lowercase letters

EBCDIC Collating Sequence

The following operating systems use the EBCDIC collating sequence:

	CMS
	z/OS

From the smallest to largest displayable character, the English-language EBCDIC sequence is

	blank.<(+\|&!$*);¬ - /,%_>?:#@'="
	abcdefghijklmnopqr~stuvwxyz
	{ABCDEFGHI}JKLMNOPQR\ STUVWXYZ
	0123456789

The main features of the EBCDIC sequence are that lowercase letters are smaller than uppercase letters and uppercase letters are smaller than digits. The blank is the smallest displayable character, followed by the other types of characters:

blank < lowercase letters < uppercase letters < digits

Top of Page