Search  
RU EN
Mar 20

SQL remove duplicates: Comprehensive methods and best practices

SQL remove duplicates: Comprehensive methods and best practices
Author:
Allan Ouko
Source:
Views:
258
SQL remove duplicates: Comprehensive methods and best practices favorites 0

Explore the different methods for filtering out and permanently removing duplicate rows using SQL. Learn the practical applications of how to remove duplicates in SQL Server, MySQL, and PostgreSQL.

Duplicate records are a common issue that can compromise data integrity and database performance. Removing these duplicates is essential for maintaining data accuracy, optimizing storage, and improving query performance. In this article, we will explore various techniques for removing duplicate rows in SQL, tailored to various use cases and database management systems.

Understanding duplicate rows in SQL

Duplicate rows in SQL refer to records within a table that contain identical values across all or selected columns. The common causes of duplicate rows in SQL include the following:

  • Missing primary keys: When tables lack a defined primary key or unique constraint, there is no mechanism to prevent the insertion of duplicate data. This can happen when a table is not normalized and/or there are transitive dependency issues.
  • Data integration issues: When merging datasets from different sources, improper joins or inconsistencies in data formats can accidentally introduce duplicates.
  • Manual data entry errors: Human error, such as entering the same record multiple times, is another common cause of duplicate rows.

In the rest of the article, we will look at how to remove duplicates in SQL, and we will divide the article into two blocks. In the first section, we will cover how to remove duplicates in the data that you are retrieving for a report or dashboard; in the second section, we will look at how to remove duplicates in the database.

Methods for removing duplicates in the data you retrieve

There are different methods of removing duplicates while retrieving records in SQL. Each method depends on the DBMS, such as SQL Server, MySQL, and PostgreSQL. In this section, we will look at the methods of removing duplicates while highlighting any special consideration for each database. Keep in mind, these methods filter the data and return unique records and they do now modify the underlying table.

Using DISTINCT keyword

The DISTINCT keyword is used in a SELECT statement to retrieve unique rows. The DISTINCT keyword syntax for removing duplicates is similar for MySQL, PostgreSQL, and SQL Server databases. The query below will retrieve unique customer names from the customers table.

SELECT DISTINCT Name
FROM customers;

Code explanation

The code snippet is a SQL query that retrieves unique values from the "Name" column in the "customers" table. Here's a breakdown:

  • SELECT DISTINCT Name: This part selects unique names, meaning it removes any duplicate entries from the results.
  • FROM customers: This specifies the table from which to retrieve the data, in this case, the "customers" table.

The query aims to provide a list of distinct customer names without any repetitions.

Using GROUP BY with aggregate functions

The GROUP BY clause, combined with other aggregate functions like MAX(), MIN(), or COUNT(), can help remove duplicate records from tables. The GROUP BY clause helps select specific records to retain while deleting other duplicates.

Suppose you want to delete duplicate customer records but keep the one with the highest ID. You will use the GROUP BY clause with the MAX() function, as shown below.

-- Delete duplicate rows from the 'customers' table (aliased as c1)
DELETE c1
FROM customers c1
-- Find the maximum ID for each unique Name
JOIN (
    SELECT Name, MAX(ID) AS MaxID
    FROM customers
    GROUP BY Name
) c2
-- Match rows based on 'Name' and keep the row with the maximum ID
ON c1.Name = c2.Name
AND c1.ID < c2.MaxID;

Code explanation

This SQL code snippet is designed to remove duplicate rows from the 'customers' table based on the 'Name' column, keeping only the row with the highest 'ID' for each unique 'Name'. Here's how it works:

  1. DELETE statement: The DELETE c1 statement specifies that rows will be deleted from the 'customers' table, which is aliased as c1.
  2. JOIN subquery: The subquery (SELECT Name, MAX(ID) AS MaxID FROM customers GROUP BY Name) c2 creates a temporary result set (c2) that contains each unique 'Name' and the maximum 'ID' associated with that 'Name'.
  3. JOIN condition: The ON c1.Name = c2.Name AND c1.ID < c2.MaxID condition ensures that the deletion only affects rows where the 'Name' matches and the 'ID' is less than the maximum 'ID' for that 'Name'. This effectively keeps the row with the highest 'ID' for each 'Name' and deletes the others.

In summary, this code removes duplicate entries in the 'customers' table by retaining only the entry with the highest 'ID' for each unique 'Name'.

MySQL and SQL Server support the above syntax of GROUP BY with aggregate functions and the JOIN clause.

Using ROW_NUMBER() with Common Table Expressions (CTE)

With the ROW_NUMBER() function combined with a Common Table Expression (CTE), you can filter out duplicates based on your criteria. The ROW_NUMBER function, when used with PARTITION BY and ORDER BY clauses, assigns a unique sequential number to each row. This method allows for filtering out the rows that do not meet the required criteria.

The following query identifies duplicates and removes all but the first occurrence.

-- Common Table Expression (CTE) to rank rows based on 'Name'
WITH CTE AS (
    SELECT ID, Name, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID ASC) AS RowNum
    FROM customers
)
-- Select only the unique records where RowNum = 1
SELECT ID, Name
FROM CTE
WHERE RowNum = 1;

Code explanation

This SQL code snippet uses a Common Table Expression (CTE) to help filter unique records based on the 'Name' column in the 'customers' table.

  1. CTE definition: The WITH CTE AS clause defines a temporary result set named CTE. Inside the CTE, the ROW_NUMBER() function is used to assign a unique sequential integer to rows within each partition of the 'Name' column. The PARTITION BY Name clause groups rows with the same 'Name', and ORDER BY ID ASC orders them by 'ID' within each group.
  2. Row filtering: The main query selects from the CTE and filters rows where RowNum = 1. This effectively selects the first occurrence of each 'Name', ensuring that only unique records based on 'Name' are returned.

The overall goal of this code is to retrieve unique records from the 'customers' table, ensuring that each 'Name' appears only once, based on the lowest 'ID' value.

This method works well for modern versions of SQL Server, MySQL, and PostgreSQL. It is useful for larger datasets or more complex conditions, as it allows you to specify exactly which duplicate to keep.

Removing duplicates using self-JOIN

A self-join allows you to compare a table to itself, making it helpful in identifying and removing duplicate rows by comparing records based on specific criteria. The following example uses the self-join to delete the row with the higher ID, keeping only the first occurrence of each name.

-- Delete duplicate rows using self-join
DELETE c1
FROM customers c1
JOIN customers c2
ON c1.Name = c2.Name AND c1.ID > c2.ID;

Code explanation

This SQL snippet is designed to remove duplicate rows from the customers table based on the Name column. Here's how it works:

  • Self-Join: The customers table is joined with itself (c1 and c2 are aliases for the same table). This allows comparison between two rows in the same table.
  • Condition: The ON clause specifies that the join should occur where c1.Name is equal to c2.Name, meaning it identifies rows with duplicate names.
  • ID comparison: The condition c1.ID > c2.ID ensures that only one of the duplicate rows is retained. It keeps the row with the smaller ID and deletes the one with the larger ID.
  • DELETE statement: The DELETE c1 command removes the duplicate row identified by the self-join condition.

Overall, this code aims to clean up the customers table by removing duplicate entries based on the Name column, keeping the row with the smallest ID for each duplicate set.

The above method works in major databases, including SQL server, MySQL, and PostgreSQL.

Methods for removing duplicates in the database

While you can remove duplicate records using queries, you can also permanently delete them from the database. This approach is important for maintaining data quality. The following methods are used to remove duplicates from the database.

Using ROW_NUMBER() and DELETE

The ROW_NUMBER() function assigns a sequential number to rows within a defined partition. When used with the DELETE statement, it helps identify duplicates by ranking rows based on specific columns and removing unwanted records. This method applies to modern versions of MySQL (from 8.0), PostgreSQL, and SQL Server.

Suppose you want to remove duplicate customer records based on the Name column, keeping only the first occurrence (smallest ID):

-- Common Table Expression (CTE) to rank rows based on 'Name'
WITH CTE AS (
    SELECT ID, Name, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY ID ASC) AS RowNum
    FROM customers
)
-- Delete rows from the 'customers' table where the row number is greater than 1
DELETE FROM customers
WHERE ID IN (SELECT ID FROM CTE WHERE RowNum > 1);

Using DELETE with subquery

Sometimes, a simple DELETE operation using a subquery can remove duplicates from the database. This method is suitable for older versions of MySQL or PostgreSQL where ROW_NUMBER() might not be available.

The query below deletes rows from the customers table where the ID is not the minimum for each Name, keeping only the row with the smallest ID for each unique Name.

-- Delete rows from the 'customers' table
DELETE FROM customers
WHERE ID NOT IN (
    -- Subquery to find the minimum ID for each unique Name
    SELECT MIN(ID)
    FROM customers
    GROUP BY Name
);

Code explanation

This SQL code snippet deletes duplicate rows from the 'customers' table, keeping only the row with the smallest ID for each unique Name. Here's how it works:

  • The DELETE FROM customers statement removes rows from the 'customers' table.
  • The WHERE ID NOT IN condition specifies which rows to delete. It targets rows whose IDs are not in the result of the subquery.
  • The subquery SELECT MIN(ID) FROM customers GROUP BY Name finds the smallest ID for each unique Name in the table. This ensures that only one row per Name, the one with the smallest ID, is retained.
  • As a result, any row with an ID not matching these minimum IDs is deleted, effectively removing duplicates while preserving one instance of each Name.

Using GROUP BY with HAVING clause

When you need to check for duplicate values in specific columns, the GROUP BY clause combined with the HAVING clause can be used to identify duplicates. This method allows you to delete specific rows based on the given criteria. This method is compatible with SQL Server, MySQL, and PostgreSQL.

The following query deletes rows from the customers table where the ID belongs to a group of duplicates.

-- Delete rows from the 'customers' table where there are duplicates
DELETE FROM customers
WHERE ID IN (
    -- Subquery to find IDs of duplicate rows
    SELECT ID
    FROM customers
    GROUP BY ID
    HAVING COUNT(*) > 1
);

Code explanation

This SQL code snippet is designed to delete duplicate rows from the 'customers' table based on the 'ID' column. Here's how it works:

  1. DELETE statement: The main operation is a DELETE statement, which removes rows from the 'customers' table.
  2. WHERE clause with subquery: The WHERE clause specifies which rows to delete. It uses a subquery to identify these rows.
  3. Subquery: The subquery selects IDs from the 'customers' table. It groups the rows by 'ID' using GROUP BY ID.
  4. HAVING clause: The HAVING COUNT(*) > 1 condition filters the grouped results to find IDs that appear more than once, indicating duplicates.
  5. IN operator: The IN operator in the main query checks if the 'ID' of a row is in the list of duplicate IDs returned by the subquery. If it is, that row will be deleted.

The code aims to remove all rows with duplicate IDs from the 'customers' table, keeping only unique IDs.

Using temporary tables for batch processing

Temporary tables are efficient for batch processing and removing duplicates in large datasets. This method is useful where single queries can cause performance issues. The following query creates a temporary table to store the minimum ID for each customer_name and delete rows from the customers table where the ID is not in the temp_customers table.

-- Create a temporary table
CREATE TEMPORARY TABLE temp_customers AS
SELECT MIN(customer_id) AS ID, customer_name
FROM customers
GROUP BY customer_name;
DELETE FROM customers
WHERE customer_id NOT IN (SELECT ID FROM temp_customers);

Code explanation

This SQL code snippet is designed to remove duplicate entries from the customers table based on the customer_name field, keeping only the entry with the smallest customer_id for each name.

  1. Create a temporary table:
    • CREATE TEMPORARY TABLE temp_customers AS SELECT MIN(customer_id) AS ID, customer_name FROM customers GROUP BY customer_name;
    • This line creates a temporary table named temp_customers. It selects the smallest customer_id for each customer_name from the customers table, effectively identifying the unique entries to keep.
  2. Delete duplicates:
    • DELETE FROM customers WHERE customer_id NOT IN (SELECT ID FROM temp_customers);
    • This line deletes rows from the customers table where the customer_id is not in the list of IDs from the temp_customers table. This means it removes all duplicate entries, leaving only the ones with the smallest customer_id for each customer_name.

Overall, the code aims to clean up the customers table by removing duplicate entries based on customer_name, keeping only the entry with the smallest customer_id.

The above syntax using CREATE TEMPORARY TABLE is only supported in MySQL and PostgreSQL databases.

Remove duplicates in SQL Server

SQL Server offers different methods of removing duplicate records from the database. These methods include using DISTINCT with INTO, ROW_NUMBER(), and temporary tables.

Using DISTINCT with INTO

You can use the DISTINCT keyword in a SELECT statement to create a new table with unique records. You can drop the old table once you verify the new table has the specified records. The following example creates the unique_customers table with unique records from the customers table.

-- Select distinct rows from 'customers' and create a new table 'unique_customers'
SELECT DISTINCT *
INTO unique_customers
FROM customers;
-- Drop the original 'customers' table to remove it from the database
DROP TABLE customers;
-- Rename the 'unique_customers' table to 'customers' to replace the original table
EXEC sp_rename 'unique_customers', 'customers';

Code explanation

This SQL code snippet is performing a series of operations on a database table named 'customers':

  1. Select distinct rows: The SELECT DISTINCT * INTO unique_customers FROM customers; statement selects all unique rows from the 'customers' table and creates a new table called 'unique_customers' with these distinct rows. This helps in removing any duplicate entries from the original table.
  2. Drop original table: The DROP TABLE customers; statement deletes the original 'customers' table from the database. This is done to remove the old table that might have duplicate entries.
  3. Rename new table: The EXEC sp_rename 'unique_customers', 'customers'; statement renames the 'unique_customers' table back to 'customers'. This step effectively replaces the original table with the new, deduplicated version.

Overall, the code aims to remove duplicate rows from the 'customers' table by creating a new table with distinct entries, deleting the old table, and renaming the new table to maintain the original table name.

Using ROW_NUMBER()

You can also use the ROW_NUMBER() function to remove duplicate records from the SQL Server. Assume you have a Customers table with duplicate rows based on the CustomerName column, and you want to delete all but the first occurrence for each duplicate group.

-- Common Table Expression (CTE) to assign a row number to each customer
WITH CTE AS (
    SELECT CustomerID, CustomerName, ROW_NUMBER() OVER (PARTITION BY CustomerName ORDER BY CustomerID ASC) AS RowNum
    FROM Customers
)
-- Delete rows from the CTE
DELETE FROM CTE
WHERE RowNum > 1;

Code explanation

This SQL code snippet uses a Common Table Expression (CTE) to help identify and remove duplicate customer records from a table named Customers.

  1. CTE definition:
    • The WITH CTE AS (...) clause creates a temporary result set named CTE.
    • Inside the CTE, the SELECT statement retrieves CustomerID and CustomerName from the Customers table.
    • ROW_NUMBER() OVER (PARTITION BY CustomerName ORDER BY CustomerID ASC) AS RowNum assigns a unique row number to each customer entry, partitioned by CustomerName. This means for each distinct CustomerName, the row numbers start at 1 and increment based on CustomerID.
  2. Delete operation:
    • The DELETE FROM CTE WHERE RowNum > 1; statement removes duplicate entries, keeping only the first occurrence of each CustomerName.
    • It deletes any row from the CTE where the row number (RowNum) is greater than 1, effectively removing duplicates while retaining the first entry for each customer.

The goal of this code is to eliminate duplicate customer records based on CustomerName, ensuring that only the first occurrence (smallest CustomerID) is kept in the Customers table.

Using temporary table

Since SQL Server does not support the CREATE TEMPORARY TABLE function, you use the SELECT INTO function. Temporary tables in SQL Server use # as a prefix for the table name.

-- Create a temporary table
SELECT MIN(CustomerID) AS ID, CustomerName
INTO #temp_customers
FROM customers
GROUP BY CustomerName;
-- Delete rows from the 'customers' table where the ID is not in the temporary table
DELETE FROM customers
WHERE CustomerIDNOT IN (SELECT ID FROM #temp_customers);
-- Optionally drop the temporary table after use
DROP TABLE #temp_customers;

Code explanation

This SQL code snippet is performing a data cleanup operation on the customers table. Here's a breakdown of what each part does:

  1. Create a temporary table:

    SELECT MIN(CustomerID) AS ID, CustomerName
    INTO #temp_customers
    FROM customers
    GROUP BY CustomerName;

    This part creates a temporary table named #temp_customers. It selects the minimum CustomerID for each CustomerName from the customers table, effectively identifying the first occurrence of each customer name. 
  2. Delete unwanted rows:

    DELETE FROM customers
    WHERE CustomerID NOT IN (SELECT ID FROM #temp_customers);

    This deletes rows from the customers table where the CustomerID does not match any ID in the #temp_customers table. Essentially, it removes duplicate customer entries, keeping only the one with the smallest CustomerID
  3. Drop the temporary table:

    DROP TABLE #temp_customers;

    This line removes the temporary table #temp_customers after it's no longer needed, cleaning up the temporary data used for the operation. 

Overall, the code aims to eliminate duplicate customer entries in the customers table, retaining only the entry with the smallest CustomerID for each unique CustomerName.

Best practices

Duplicate rows are a common problem affecting data quality and database performance. Consider the following best practices to prevent duplicate records from being inserted in your database.

  • Use primary keys: The primary key column ensures that each record contains unique information, preventing duplicate values from entering the table.
  • Implement unique constraints: Applying unique constraints to any column ensures no duplicates exist across non-primary key columns, such as email addresses or phone numbers.
  • Proper database design and normalization: Effective schema design and database normalization help reduce redundancy and duplicate data. This approach ensures each record is stored in specific tables.
  • Use unique indexes: Use unique indexes to ensure that certain column combinations are unique without requiring full table-level constraints across the entire dataset.
  • Regular data audits: Perform regular data audits by running queries to identify potential duplicates based on your business rules.

Conclusion

Identifying and removing duplicate rows is important to maintaining database efficiency and data accuracy. It is always a best practice to back up your data before making modifications to ensure no accidental data loss occurs.

Frequently Asked SQL Questions

What causes duplicate rows in SQL databases?
Duplicate rows can occur due to several factors, including improper database design, missing primary keys, data integration from multiple sources, manual data entry errors, or data migration issues where validation isn’t properly enforced.

Can I prevent duplicates based on multiple columns?
Yes, you can enforce uniqueness across multiple columns using composite keys or unique constraints. This ensures that combinations of values across those columns remain unique.

How does the DISTINCT keyword remove duplicate rows?
Using the DISTINCT keyword only removes duplicates in the query results and does not alter the underlying data.

Which method can you use to permanently delete duplicate records from the database?
You can use ROW_NUMBER() with DELETE, DELETE with subquery, GROUP BY with HAVING clause, and temporary tables for batch processing to permanently delete duplicate rows from the database.

Can duplicates affect the performance of my database?
Yes, duplicates can negatively impact performance by increasing storage costs, slowing queries, and complicating data analysis.

Similar
Aug 29, 2024
Author: Juldhais Hengkyawan
In some scenarios, we need to use different database providers for development and production environments. For instance, consider a scenario where we use SQLite in the development environment and SQL Server for the production environment. This article will guide us...
12 мая 2023 г.
NULL - это специальное значение, которое используется в SQL для обозначения отсутствия данных. Оно отличается от пустой строки или нулевого значения, так как NULL означает отсутствие какого-либо значения в ячейке таблицы. История появления NULL в SQL довольно интересна и длинна....
Apr 6, 2013
Introduction A dynamic SQL in a stored procedure is a single Transact-SQL statement or a set of statements stored in a variable and executed using a SQL command. There may be several methods of implementing this in SQL Server. This...
Dec 21, 2023
Author: Jeremy Wells
Introduction and prerequisites This post is part of an ongoing series where we build a “walking skeleton” application using ASP.NET Core and Angular as well as other technologies for deployment and testing. By now, our application is a minimally functional...
Send message
Type
Email
Your name
*Message