SQL Scripts to check for duplicates
Check for Duplicate Rows in a Table
Situation: When running a data pipeline, it is often easier to check for data duplicates after load rather than during extract or transform (ETL).
Context: The script below checks for the subgraph_bank_transactions
table, a pipeline that pulls in data from the $BANK token subgraph into a postgres database. This is part of the DAODash project.
Approach 1: id
has duplicates
If the query returned no data
, that means there are no duplicates.
In this case, id
does have duplicates.
SELECT
id,
COUNT(id)
FROM subgraph_bank_transactions
GROUP BY id
HAVING COUNT(id) > 1;
id | count |
---|---|
32165 | 2 |
31561 | 2 |
32657 | 4 |
Approach 1a: graph_id
does not have duplicates
SELECT
graph_id,
COUNT(graph_id)
FROM subgraph_bank_transactions
GROUP BY graph_id
HAVING COUNT(graph_id) > 1;
Result: The query returned no data
Approach 2
SELECT
a.*
FROM subgraph_bank_transactions a
JOIN (SELECT id, graph_id, amount_display, COUNT(*)
FROM subgraph_bank_transactions
GROUP BY id, graph_id, amount_display
HAVING COUNT(*) > 1) b
ON a.id = b.id
AND a.graph_id = b.graph_id
AND a.amount_display = b.amount_display
ORDER BY a.id
Sources:
For more content on web3 data find me on Twitter.