Duplicate data in data lakes is one of the most common and frustrating challenges for data engineers.

But what happens when:

You don’t have a primary key
You don’t have a reliable incremental column
You don’t have a unique timestamp
And your Iceberg table already contains duplicate records?

In this blog, I will walk you through a real-world approach to detecting, removing, and safely restoring unique data in an Apache Iceberg table using Amazon Athena.

This method works even when your table has:

No unique constraint
No incremental load column
No CDC (Change Data Capture) metadata

Table of Contents

🚨 The Problem: No Unique Identifier

Imagine you have an Iceberg table like this:

CREATE TABLE demo_silver_db.bom_meta (
   messageid string,
   accountid string,
   createdat bigint,
   ...
)
WITH (
   table_type='ICEBERG'
);

Over time, due to:

Multiple ingestion retries
Reprocessing jobs
Failed merges
Streaming duplicates

You end up with duplicate rows.

But there is no single unique column to identify duplicates.

🧠 Strategy Overview

Since Iceberg supports:

Snapshots
Time travel
Row-level DELETE
Insert overwrite patterns

We can:

Go back to a historical snapshot
Identify duplicate groups
Extract unique records using DISTINCT
Reinsert cleaned data
Create a fresh snapshot

This gives us a clean version of the table.

🔍 Step 1: Identify Duplicate Records

We first detect duplicates based on business logic.

Example duplicate definition:

messageid + accountid + year + month + day + hour

SELECT 
    messageid,
    accountid,
    year,
    month,
    day,
    hour,
    COUNT(*) AS duplicate_count
FROM demo_silver_db.bom_meta
GROUP BY 
    messageid,
    accountid,
    year,
    month,
    day,
    hour
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC;

This tells us where duplicates exist.

⏪ Step 2: Go Back to Historical Snapshot

Iceberg allows time travel.

SELECT *
FROM demo_silver_db.bom_meta
FOR TIMESTAMP AS OF TIMESTAMP '2026-03-02 04:15:41.902 UTC';

This reads the table as it existed at that exact time.

No data is modified — it’s a safe read.

🧹 Step 3: Reinsert Only Unique Records

We use SELECT DISTINCT to remove duplicate rows and reinsert them into the table.

INSERT INTO demo_silver_db.bom_meta
SELECT DISTINCT
    messageid,
    accountid,
    createdat,
    "data.CalcBem",
    "data.CalcOK",
    ...
    year,
    month,
    day,
    hour
FROM demo_silver_db.bom_meta
FOR TIMESTAMP AS OF TIMESTAMP '2026-03-02 04:15:41.902 UTC'
WHERE messageid IN (
    SELECT messageid
    FROM demo_silver_db.bom_meta
    FOR TIMESTAMP AS OF TIMESTAMP '2026-03-02 04:15:41.902 UTC'
    GROUP BY messageid, accountid, year, month, day, hour
    HAVING COUNT(*) > 1
);

What this does:

Reads duplicate data from historical snapshot
Removes exact duplicate rows
Reinserts only unique values
Creates a new Iceberg snapshot

🧾 Why This Works

Iceberg tables:

Do not enforce primary keys
Allow duplicate records
Track table versions using metadata

Using snapshot + distinct:

We avoid corrupting current data
We preserve historical lineage
We create a clean state

⚠ Important Considerations

1️⃣ DISTINCT works only if entire row matches

If duplicates differ in even one column, DISTINCT will not remove them.

In that case, use:

ROW_NUMBER() OVER (
    PARTITION BY messageid, accountid
    ORDER BY createdat ASC
)

2️⃣ Always Validate Before DELETE

Before deleting duplicates:

SELECT *
FROM (
   SELECT *,
          ROW_NUMBER() OVER (
             PARTITION BY messageid, accountid
             ORDER BY createdat
          ) rn
   FROM demo_silver_db.bom_meta
)
WHERE rn > 1;

Never directly delete without previewing.

3️⃣ Every INSERT or DELETE Creates a New Snapshot

Iceberg automatically creates a new snapshot after:

INSERT
DELETE
MERGE

So rollback is always possible.

💡 Alternative: Row Number Based Deduplication (Recommended)

If your duplicates are defined by business key:

INSERT INTO demo_silver_db.bom_meta
SELECT *
FROM (
   SELECT *,
          ROW_NUMBER() OVER (
             PARTITION BY messageid, accountid
             ORDER BY createdat
          ) rn
   FROM demo_silver_db.bom_meta
   FOR TIMESTAMP AS OF TIMESTAMP '2026-03-02 04:15:41.902 UTC'
)
WHERE rn = 1;

This keeps only one record per key.

📊 When Should You Use This Approach?

Use this method when:

❌ No primary key exists
❌ No reliable incremental column
❌ No CDC mechanism
❌ Duplicate ingestion happened
❌ Iceberg table already corrupted

🏁 Final Thoughts

Removing duplicates in Apache Iceberg without a unique identifier is possible using:

Time travel
Snapshot reads
DISTINCT
ROW_NUMBER
Controlled INSERT patterns

The key is:

Always define your business-level uniqueness first.

Iceberg gives you the tools — but you must define the logic.

How to Remove Duplicates from an Apache Iceberg Table in Athena (Without a Unique ID or Timestamp)