Counting Duplicate Rows in a Table

Problem

You want to count the number of duplicate rows in a table and generate an output column that shows how many times each row occurs.

Background Information

There is one input table, called DUPLICATES, that contains the following data:
data Duplicates;
   input LastName $ FirstName $ City $ State $;
   datalines;
Smith John Richmond Virginia
Johnson Mary Miami Florida
Smith John Richmond Virginia
Reed Sam Portland Oregon
Davis Karen Chicago Illinois
Davis Karen Chicago Illinois
Thompson Jennifer Houston Texas
Smith John Richmond Virginia
Johnson Mary Miami Florida
;
proc print data=Duplicates;
  title 'Sample Data for Counting Duplicates';
run;
Sample Input Table for Counting Duplicates
Sample Data for Counting Duplicates

Solution

Use the following PROC SQL code to count the duplicate rows:
proc sql;
   title 'Duplicate Rows in DUPLICATES Table';
   select *, count(*) as Count
      from Duplicates
      group by LastName, FirstName, City, State
      having count(*) > 1;
     
PROC SQL Output for Counting Duplicates
Duplicate Rows in DUPLICATES Table

How It Works

This solution uses a query that performs the following:
  • selects all columns
  • counts all rows
  • groups all of the rows in the Duplicates table by matching rows
  • excludes the rows that have no duplicates
Note: You must include all of the columns in your table in the GROUP BY clause to find exact duplicates.