Usage Note 37581: How can I eliminate duplicate observations from a large data set without sorting
Using a CLASS statement in PROC SUMMARY does not require the data set to
be sorted in advance. The CLASS statement will collapse observations
with the same variable values. The _FREQ_ variable in the output data set shows the frequency count of observations with that combination of
CLASS variable values. Click on the Results tab to see the resulting
data set.
/* Example */
/* Create a data set with duplicate observations */
data test;
input x y z;
cards;
1 1 1
1 1 1
1 2 1
1 2 2
2 2 2
2 2 2
2 2 2
2 2 1
;
run;
proc summary data=test nway;
class x y z;
output out=test1(drop=_type_);
run;
proc print data=test1;
run;
Operating System and Release Information
SAS System | Base SAS | z/OS | | |
OpenVMS VAX | | |
Microsoft® Windows® for 64-Bit Itanium-based Systems | | |
Microsoft Windows Server 2003 Datacenter 64-bit Edition | | |
Microsoft Windows Server 2003 Enterprise 64-bit Edition | | |
Microsoft Windows XP 64-bit Edition | | |
Microsoft® Windows® for x64 | | |
OS/2 | | |
Microsoft Windows 7 | | |
Microsoft Windows 95/98 | | |
Microsoft Windows 2000 Advanced Server | | |
Microsoft Windows 2000 Datacenter Server | | |
Microsoft Windows 2000 Server | | |
Microsoft Windows 2000 Professional | | |
Microsoft Windows NT Workstation | | |
Microsoft Windows Server 2003 Datacenter Edition | | |
Microsoft Windows Server 2003 Enterprise Edition | | |
Microsoft Windows Server 2003 Standard Edition | | |
Microsoft Windows Server 2008 | | |
Microsoft Windows XP Professional | | |
Windows Millennium Edition (Me) | | |
Windows Vista | | |
64-bit Enabled AIX | | |
64-bit Enabled HP-UX | | |
64-bit Enabled Solaris | | |
ABI+ for Intel Architecture | | |
AIX | | |
HP-UX | | |
HP-UX IPF | | |
IRIX | | |
Linux | | |
Linux for x64 | | |
Linux on Itanium | | |
OpenVMS Alpha | | |
OpenVMS on HP Integrity | | |
Solaris | | |
Solaris for x64 | | |
Tru64 UNIX | | |
*
For software releases that are not yet generally available, the Fixed
Release is the software release in which the problem is planned to be
fixed.
Obs x y z _FREQ_
1 1 1 1 2
2 1 2 1 1
3 1 2 2 1
4 2 2 1 1
5 2 2 2 3
PROC SORT with the NODUP or NODUPKEY option provides a way to eliminate duplicate observations from a data set. However, with a very large data set, PROC SORT may not be an efficient use of resources. This sample provides an alternative using PROC SUMMARY.
Date Modified: | 2009-10-26 14:25:20 |
Date Created: | 2009-10-26 08:30:04 |