Explore the different methods for filtering out and permanently removing duplicate rows using SQL. Learn the practical applications of how to remove duplicates in SQL Server, MySQL, and PostgreSQL.
Duplicate records are a common issue that can compromise data integrity and database performance. Removing these duplicates is essential for maintaining data accuracy, optimizing storage, and improving query performance. In this article, we will explore various techniques for removing duplicate rows in SQL, tailored to various use cases and database management systems.
Understanding duplicate rows in SQL
Duplicate rows in SQL refer to records within a table that contain identical values across all or selected columns. The common causes of duplicate rows in SQL include the following:
- Missing primary keys: When tables lack a defined primary key or unique constraint, there is no mechanism to prevent the insertion of duplicate data. This can happen when a table is not normalized and/or there are transitive dependency issues.
- Data integration issues: When merging datasets from different sources, improper joins or inconsistencies in data formats can accidentally introduce duplicates.
- Manual data entry errors: Human error, such as entering the same record multiple times, is another common cause of duplicate rows.
In the rest of the article, we will look at how to remove duplicates in SQL, and we will divide the article into two blocks. In the first section, we will cover how to remove duplicates in the data that you are retrieving for a report or dashboard; in the second section, we will look at how to remove duplicates in the database.
Methods for removing duplicates in the data you retrieve
There are different methods of removing duplicates while retrieving records in SQL. Each method depends on the DBMS, such as SQL Server, MySQL, and PostgreSQL. In this section, we will look at the methods of removing duplicates while highlighting any special consideration for each database. Keep in mind, these methods filter the data and return unique records and they do now modify the underlying table.
Using DISTINCT keyword
The DISTINCT
keyword is used in a SELECT
statement to retrieve unique rows. The DISTINCT
keyword syntax for removing duplicates is similar for MySQL, PostgreSQL, and SQL Server databases. The query below will retrieve unique customer names from the customers
table.
SELECT DISTINCT Name
FROM customers;
Code explanation
The code snippet is a SQL query that retrieves unique values from the "Name" column in the "customers" table. Here's a breakdown:
SELECT DISTINCT Name
: This part selects unique names, meaning it removes any duplicate entries from the results.
FROM customers
: This specifies the table from which to retrieve the data, in this case, the "customers" table.
The query aims to provide a list of distinct customer names without any repetitions.
Using GROUP BY with aggregate functions
The GROUP BY
clause, combined with other aggregate functions like MAX()
, MIN()
, or COUNT()
, can help remove duplicate records from tables. The GROUP BY
clause helps select specific records to retain while deleting other duplicates.
Suppose you want to delete duplicate customer records but keep the one with the highest ID
. You will use the GROUP BY
clause with the MAX()
function, as shown below.
-- Delete duplicate rows from the 'customers' table (aliased as c1)
DELETE c1
FROM customers c1
-- Find the maximum ID for each unique Name
JOIN (
SELECT Name, MAX(ID) AS MaxID
FROM customers
GROUP BY Name
) c2
-- Match rows based on 'Name' and keep the row with the maximum ID
ON c1.Name = c2.Name
AND c1.ID < c2.MaxID;
Code explanation
This SQL code snippet is designed to remove duplicate rows from the 'customers' table based on the 'Name' column, keeping only the row with the highest 'ID' for each unique 'Name'. Here's how it works:
- DELETE statement: The
DELETE c1
statement specifies that rows will be deleted from the 'customers' table, which is aliased as c1
.
- JOIN subquery: The subquery
(SELECT Name, MAX(ID) AS MaxID FROM customers GROUP BY Name) c2
creates a temporary result set (c2)
that contains each unique 'Name' and the maximum 'ID' associated with that 'Name'.
- JOIN condition: The
ON c1.Name = c2.Name AND c1.ID < c2.MaxID
condition ensures that the deletion only affects rows where the 'Name' matches and the 'ID' is less than the maximum 'ID' for that 'Name'. This effectively keeps the row with the highest 'ID' for each 'Name' and deletes the others.
In summary, this code removes duplicate entries in the 'customers' table by retaining only the entry with the highest 'ID' for each unique 'Name'.
MySQL and SQL Server support the above syntax of GROUP BY
with aggregate functions and the JOIN
clause.
Using ROW_NUMBER() with Common Table Expressions (CTE)
With the ROW_NUMBER()
function combined with a Common Table Expression (CTE), you can filter out duplicates based on your criteria. The ROW_NUMBER
function, when used with PARTITION BY
and ORDER BY
clauses, assigns a unique sequential number to each row. This method allows for filtering out the rows that do not meet the required criteria.
The following query identifies duplicates and removes all but the first occurrence.
-- Common Table Expression (CTE) to rank rows based on 'Name'
WITH CTE AS (
SELECT ID, Name, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID ASC) AS RowNum
FROM customers
)
-- Select only the unique records where RowNum = 1
SELECT ID, Name
FROM CTE
WHERE RowNum = 1;
Code explanation
This SQL code snippet uses a Common Table Expression (CTE) to help filter unique records based on the 'Name' column in the 'customers' table.
- CTE definition: The
WITH CTE AS
clause defines a temporary result set named CTE
. Inside the CTE, the ROW_NUMBER()
function is used to assign a unique sequential integer to rows within each partition of the 'Name' column. The PARTITION BY Name
clause groups rows with the same 'Name', and ORDER BY ID ASC
orders them by 'ID' within each group.
- Row filtering: The main query selects from the CTE and filters rows where
RowNum = 1
. This effectively selects the first occurrence of each 'Name', ensuring that only unique records based on 'Name' are returned.
The overall goal of this code is to retrieve unique records from the 'customers' table, ensuring that each 'Name' appears only once, based on the lowest 'ID' value.
This method works well for modern versions of SQL Server, MySQL, and PostgreSQL. It is useful for larger datasets or more complex conditions, as it allows you to specify exactly which duplicate to keep.
Removing duplicates using self-JOIN
A self-join allows you to compare a table to itself, making it helpful in identifying and removing duplicate rows by comparing records based on specific criteria. The following example uses the self-join to delete the row with the higher ID, keeping only the first occurrence of each name.
-- Delete duplicate rows using self-join
DELETE c1
FROM customers c1
JOIN customers c2
ON c1.Name = c2.Name AND c1.ID > c2.ID;
Code explanation
This SQL snippet is designed to remove duplicate rows from the customers
table based on the Name
column. Here's how it works:
- Self-Join: The
customers
table is joined with itself (c1
and c2
are aliases for the same table). This allows comparison between two rows in the same table.
- Condition: The
ON
clause specifies that the join should occur where c1.Name
is equal to c2.Name
, meaning it identifies rows with duplicate names.
- ID comparison: The condition
c1.ID > c2.ID
ensures that only one of the duplicate rows is retained. It keeps the row with the smaller ID
and deletes the one with the larger ID
.
- DELETE statement: The
DELETE c1
command removes the duplicate row identified by the self-join condition.
Overall, this code aims to clean up the customers
table by removing duplicate entries based on the Name
column, keeping the row with the smallest ID
for each duplicate set.
The above method works in major databases, including SQL server, MySQL, and PostgreSQL.
Methods for removing duplicates in the database
While you can remove duplicate records using queries, you can also permanently delete them from the database. This approach is important for maintaining data quality. The following methods are used to remove duplicates from the database.
Using ROW_NUMBER() and DELETE
The ROW_NUMBER()
function assigns a sequential number to rows within a defined partition. When used with the DELETE
statement, it helps identify duplicates by ranking rows based on specific columns and removing unwanted records. This method applies to modern versions of MySQL (from 8.0), PostgreSQL, and SQL Server.
Suppose you want to remove duplicate customer records based on the Name column, keeping only the first occurrence (smallest ID):
-- Common Table Expression (CTE) to rank rows based on 'Name'
WITH CTE AS (
SELECT ID, Name, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID ASC) AS RowNum
FROM customers
)
-- Delete rows from the 'customers' table where the row number is greater than 1
DELETE FROM customers
WHERE ID IN (SELECT ID FROM CTE WHERE RowNum > 1);
Using DELETE with subquery
Sometimes, a simple DELETE
operation using a subquery can remove duplicates from the database. This method is suitable for older versions of MySQL or PostgreSQL where ROW_NUMBER()
might not be available.
The query below deletes rows from the customers
table where the ID
is not the minimum for each Name
, keeping only the row with the smallest ID
for each unique Name
.
-- Delete rows from the 'customers' table
DELETE FROM customers
WHERE ID NOT IN (
-- Subquery to find the minimum ID for each unique Name
SELECT MIN(ID)
FROM customers
GROUP BY Name
);
Code explanation
This SQL code snippet deletes duplicate rows from the 'customers' table, keeping only the row with the smallest ID for each unique Name. Here's how it works:
- The
DELETE FROM customers
statement removes rows from the 'customers' table.
- The
WHERE ID NOT IN
condition specifies which rows to delete. It targets rows whose IDs are not in the result of the subquery.
- The subquery
SELECT MIN(ID) FROM customers GROUP BY Name
finds the smallest ID for each unique Name in the table. This ensures that only one row per Name, the one with the smallest ID, is retained.
- As a result, any row with an ID not matching these minimum IDs is deleted, effectively removing duplicates while preserving one instance of each Name.
Using GROUP BY with HAVING clause
When you need to check for duplicate values in specific columns, the GROUP BY
clause combined with the HAVING
clause can be used to identify duplicates. This method allows you to delete specific rows based on the given criteria. This method is compatible with SQL Server, MySQL, and PostgreSQL.
The following query deletes rows from the customers
table where the ID
belongs to a group of duplicates.
-- Delete rows from the 'customers' table where there are duplicates
DELETE FROM customers
WHERE ID IN (
-- Subquery to find IDs of duplicate rows
SELECT ID
FROM customers
GROUP BY ID
HAVING COUNT(*) > 1
);
Code explanation
This SQL code snippet is designed to delete duplicate rows from the 'customers' table based on the 'ID' column. Here's how it works:
- DELETE statement: The main operation is a
DELETE
statement, which removes rows from the 'customers' table.
- WHERE clause with subquery: The
WHERE
clause specifies which rows to delete. It uses a subquery to identify these rows.
- Subquery: The subquery selects IDs from the 'customers' table. It groups the rows by 'ID' using
GROUP BY ID
.
- HAVING clause: The
HAVING COUNT(*) > 1
condition filters the grouped results to find IDs that appear more than once, indicating duplicates.
- IN operator: The
IN
operator in the main query checks if the 'ID' of a row is in the list of duplicate IDs returned by the subquery. If it is, that row will be deleted.
The code aims to remove all rows with duplicate IDs from the 'customers' table, keeping only unique IDs.
Using temporary tables for batch processing
Temporary tables are efficient for batch processing and removing duplicates in large datasets. This method is useful where single queries can cause performance issues. The following query creates a temporary table to store the minimum ID
for each customer_name
and delete rows from the customers
table where the ID
is not in the temp_customers
table.
-- Create a temporary table
CREATE TEMPORARY TABLE temp_customers AS
SELECT MIN(customer_id) AS ID, customer_name
FROM customers
GROUP BY customer_name;
DELETE FROM customers
WHERE customer_id NOT IN (SELECT ID FROM temp_customers);
Code explanation
This SQL code snippet is designed to remove duplicate entries from the customers
table based on the customer_name
field, keeping only the entry with the smallest customer_id
for each name.
- Create a temporary table:
CREATE TEMPORARY TABLE temp_customers AS SELECT MIN(customer_id) AS ID, customer_name FROM customers GROUP BY customer_name;
- This line creates a temporary table named
temp_customers
. It selects the smallest customer_id
for each customer_name
from the customers
table, effectively identifying the unique entries to keep.
- Delete duplicates:
DELETE FROM customers WHERE customer_id NOT IN (SELECT ID FROM temp_customers);
- This line deletes rows from the
customers
table where the customer_id
is not in the list of IDs from the temp_customers
table. This means it removes all duplicate entries, leaving only the ones with the smallest customer_id
for each customer_name
.
Overall, the code aims to clean up the customers
table by removing duplicate entries based on customer_name
, keeping only the entry with the smallest customer_id
.
The above syntax using CREATE TEMPORARY TABLE
is only supported in MySQL and PostgreSQL databases.
Remove duplicates in SQL Server
SQL Server offers different methods of removing duplicate records from the database. These methods include using DISTINCT
with INTO
, ROW_NUMBER()
, and temporary tables.
Using DISTINCT with INTO
You can use the DISTINCT
keyword in a SELECT
statement to create a new table with unique records. You can drop the old table once you verify the new table has the specified records. The following example creates the unique_customers
table with unique records from the customers
table.
-- Select distinct rows from 'customers' and create a new table 'unique_customers'
SELECT DISTINCT *
INTO unique_customers
FROM customers;
-- Drop the original 'customers' table to remove it from the database
DROP TABLE customers;
-- Rename the 'unique_customers' table to 'customers' to replace the original table
EXEC sp_rename 'unique_customers', 'customers';
Code explanation
This SQL code snippet is performing a series of operations on a database table named 'customers':
- Select distinct rows: The
SELECT DISTINCT * INTO unique_customers FROM customers;
statement selects all unique rows from the 'customers' table and creates a new table called 'unique_customers' with these distinct rows. This helps in removing any duplicate entries from the original table.
- Drop original table: The
DROP TABLE customers;
statement deletes the original 'customers' table from the database. This is done to remove the old table that might have duplicate entries.
- Rename new table: The
EXEC sp_rename 'unique_customers', 'customers';
statement renames the 'unique_customers' table back to 'customers'. This step effectively replaces the original table with the new, deduplicated version.
Overall, the code aims to remove duplicate rows from the 'customers' table by creating a new table with distinct entries, deleting the old table, and renaming the new table to maintain the original table name.
Using ROW_NUMBER()
You can also use the ROW_NUMBER()
function to remove duplicate records from the SQL Server. Assume you have a Customers
table with duplicate rows based on the CustomerName
column, and you want to delete all but the first occurrence for each duplicate group.
-- Common Table Expression (CTE) to assign a row number to each customer
WITH CTE AS (
SELECT CustomerID, CustomerName, ROW_NUMBER() OVER (PARTITION BY CustomerName ORDER BY CustomerID ASC) AS RowNum
FROM Customers
)
-- Delete rows from the CTE
DELETE FROM CTE
WHERE RowNum > 1;
Code explanation
This SQL code snippet uses a Common Table Expression (CTE) to help identify and remove duplicate customer records from a table named Customers
.
- CTE definition:
- The
WITH CTE AS (...)
clause creates a temporary result set named CTE
.
- Inside the CTE, the
SELECT
statement retrieves CustomerID
and CustomerName
from the Customers
table.
ROW_NUMBER() OVER (PARTITION BY CustomerName ORDER BY CustomerID ASC) AS RowNum
assigns a unique row number to each customer entry, partitioned by CustomerName
. This means for each distinct CustomerName
, the row numbers start at 1 and increment based on CustomerID
.
- Delete operation:
- The
DELETE FROM CTE WHERE RowNum > 1;
statement removes duplicate entries, keeping only the first occurrence of each CustomerName
.
- It deletes any row from the CTE where the row number (
RowNum
) is greater than 1, effectively removing duplicates while retaining the first entry for each customer.
The goal of this code is to eliminate duplicate customer records based on CustomerName
, ensuring that only the first occurrence (smallest CustomerID
) is kept in the Customers
table.
Using temporary table
Since SQL Server does not support the CREATE TEMPORARY TABLE
function, you use the SELECT INTO
function. Temporary tables in SQL Server use #
as a prefix for the table name.
-- Create a temporary table
SELECT MIN(CustomerID) AS ID, CustomerName
INTO #temp_customers
FROM customers
GROUP BY CustomerName;
-- Delete rows from the 'customers' table where the ID is not in the temporary table
DELETE FROM customers
WHERE CustomerIDNOT IN (SELECT ID FROM #temp_customers);
-- Optionally drop the temporary table after use
DROP TABLE #temp_customers;
Code explanation
This SQL code snippet is performing a data cleanup operation on the customers
table. Here's a breakdown of what each part does:
-
Create a temporary table:
SELECT MIN(CustomerID) AS ID, CustomerName
INTO #temp_customers
FROM customers
GROUP BY CustomerName;
This part creates a temporary table named #temp_customers
. It selects the minimum CustomerID
for each CustomerName
from the customers table, effectively identifying the first occurrence of each customer name.
-
Delete unwanted rows:
DELETE FROM customers
WHERE CustomerID NOT IN (SELECT ID FROM #temp_customers);
This deletes rows from the customers
table where the CustomerID
does not match any ID
in the #temp_customers
table. Essentially, it removes duplicate customer entries, keeping only the one with the smallest CustomerID
.
-
Drop the temporary table:
DROP TABLE #temp_customers;
This line removes the temporary table #temp_customers
after it's no longer needed, cleaning up the temporary data used for the operation.
Overall, the code aims to eliminate duplicate customer entries in the customers
table, retaining only the entry with the smallest CustomerID
for each unique CustomerName
.
Best practices
Duplicate rows are a common problem affecting data quality and database performance. Consider the following best practices to prevent duplicate records from being inserted in your database.
- Use primary keys: The primary key column ensures that each record contains unique information, preventing duplicate values from entering the table.
- Implement unique constraints: Applying unique constraints to any column ensures no duplicates exist across non-primary key columns, such as email addresses or phone numbers.
- Proper database design and normalization: Effective schema design and database normalization help reduce redundancy and duplicate data. This approach ensures each record is stored in specific tables.
- Use unique indexes: Use unique indexes to ensure that certain column combinations are unique without requiring full table-level constraints across the entire dataset.
- Regular data audits: Perform regular data audits by running queries to identify potential duplicates based on your business rules.
Conclusion
Identifying and removing duplicate rows is important to maintaining database efficiency and data accuracy. It is always a best practice to back up your data before making modifications to ensure no accidental data loss occurs.
Frequently Asked SQL Questions
What causes duplicate rows in SQL databases?
Duplicate rows can occur due to several factors, including improper database design, missing primary keys, data integration from multiple sources, manual data entry errors, or data migration issues where validation isn’t properly enforced.
Can I prevent duplicates based on multiple columns?
Yes, you can enforce uniqueness across multiple columns using composite keys or unique constraints. This ensures that combinations of values across those columns remain unique.
How does the DISTINCT keyword remove duplicate rows?
Using the DISTINCT
keyword only removes duplicates in the query results and does not alter the underlying data.
Which method can you use to permanently delete duplicate records from the database?
You can use ROW_NUMBER()
with DELETE
, DELETE
with subquery, GROUP BY
with HAVING
clause, and temporary tables for batch processing to permanently delete duplicate rows from the database.
Can duplicates affect the performance of my database?
Yes, duplicates can negatively impact performance by increasing storage costs, slowing queries, and complicating data analysis.