Text Data De-duplication Cleanup

Заказчик: AI | Опубликовано: 28.02.2026

I need a clean, duplicate-free version of my text dataset. The source is a collection of plain-text CSV/Excel files that currently contain repeated lines and partial matches that must be treated as duplicates as well. Here is what I expect: • One cleaned master file (same structure as the original) with every duplicate entry removed. • A brief log or summary showing how many rows were discarded so I can track the impact. Any method that reliably spots perfect and near-perfect text duplicates is fine—Excel power-query, Google Sheets, Python/pandas, or a small script. Just keep the original column order and encoding intact so I can drop the file straight back into my workflow. Before delivery, please spot-check a handful of rows to confirm no unique content was lost. Once that quick sanity check passes, send over the final file plus the duplicate-removal log and we’re done.