SQL Scripts to check for duplicates

Check for Duplicate Rows in a Table

Situation: When running a data pipeline, it is often easier to check for data duplicates after load rather than during extract or transform (ETL).

Context: The script below checks for the subgraph_bank_transactions table, a pipeline that pulls in data from the $BANK token subgraph into a postgres database. This is part of the DAODash project.

Approach 1: id has duplicates

If the query returned no data, that means there are no duplicates.

In this case, id does have duplicates.

SELECT
  id,
  COUNT(id)
FROM subgraph_bank_transactions 
GROUP BY id
HAVING COUNT(id) > 1;
id count
32165 2
31561 2
32657 4

Approach 1a: graph_id does not have duplicates

SELECT
  graph_id,
  COUNT(graph_id)
FROM subgraph_bank_transactions 
GROUP BY graph_id
HAVING COUNT(graph_id) > 1;

Result: The query returned no data

Approach 2

SELECT 
  a.*
FROM subgraph_bank_transactions a
JOIN (SELECT id, graph_id, amount_display, COUNT(*)
      FROM subgraph_bank_transactions 
      GROUP BY id, graph_id, amount_display
      HAVING COUNT(*) > 1) b 
ON a.id = b.id
AND a.graph_id = b.graph_id
AND a.amount_display = b.amount_display 
ORDER BY a.id

Sources:

Learnsql.com

Chartio Data Tutorials

For more content on web3 data find me on Twitter.

Previous
Next