October 17, 2010

Find and Delete Duplicate Rows From Tables without Primary Key

Someone recently asked me how to find and delete duplicate rows from sample tables, which do not have a Primary Key. I have written one such sample on MSDN code Find and/or Delete Duplicate Rows which I will share here


-- Suppress data loading messages

-- Create Table
CREATE TABLE #Customers (ID integer, CustName varchar(20), Pincode int)

-- Load Sample Data in Table
INSERT INTO #Customers VALUES (1, 'Jack',45454 )
INSERT INTO #Customers VALUES (2, 'Jill', 43453)
INSERT INTO #Customers VALUES (3, 'Tom', 43453)
INSERT INTO #Customers VALUES (4, 'Kathy', 22343)
INSERT INTO #Customers VALUES (5, 'David', 65443)
INSERT INTO #Customers VALUES (6, 'Kathy', 22343)
INSERT INTO #Customers VALUES (7, 'Kim', 65443)
INSERT INTO #Customers VALUES (8, 'Hoggart', 33443)
INSERT INTO #Customers VALUES (9, 'Kate', 61143)
INSERT INTO #Customers VALUES (10, 'Kim', 65443)

To indentify duplicate rows, the trick is to Group the rows by CustName, Pincode. Rows having similar CustName and Pincode will have more than one rows in the grouping. So locate them using HAVING COUNT(*) > 1. If duplicate values are encountered, return the maximum ID for each duplicate row. Using the outer query, delete any ID returned by subquery.

-- Find Duplicate Rows
SELECT MAX(ID) as ID, CustName, Pincode FROM #Customers
GROUP BY CustName, Pincode

-- Delete Duplicate Rows
DELETE FROM #Customers
( SELECT MAX(ID) FROM #Customers
GROUP BY CustName, Pincode
SELECT * FROM #Customers



About The Author

Suprotim Agarwal, ASP.NET Architecture MVP works as an Architect Consultant and provides consultancy on how to design and develop Web applications.

Suprotim is also the founder and primary contributor to DevCurry, DotNetCurry and SQLServerCurry. He has also written an EBook 51 Recipes using jQuery with ASP.NET Controls.

Follow him on twitter @suprotimagarwal


Robert Seso said...

I'd change ...WHERE ID IN (...) to ...WHERE ID NOT IN (...), that way you'd get rid of all the duplicates, regardless how many there are.

Federico Emi said...

Robert adding NOT IN will delete the rows that are 'not duplicate' and that's exactly the opposite of what the author describes.

If you are suggesting an alternate query, please elaborate since I get the desired results using the query shown in this article

Anonymous said...

This won't work if there is more than one duplicate as it only deletes the duplicate with the largest ID field.

darlove said...

To satisfy the requirement that all dups are deleted, no matter how many there are, use the row_number() window function. Number the rows within the same group according to your id field in an ascending order, and then delete all rows that have the generated row_number() > 1. This way you'll be left with no dups and it's all in one go.