Big data provle: Automated Medicine Data Matching

I have two datasets that must be reconciled. • Master catalogue: 300 k rows × 5 columns, each a unique medicine product. • Transactional file: 300 million rows, roughly 90 % medicine items. My goal is to match every medicine record in the large file to its counterpart in the master list. Name spellings and outright duplicates are the main headaches. Over 99.9 % of the items are actual medicines and would map once the inconsistencies are ironed out, so I want the process to be fully automated, driven by a robust auto-correct algorithm rather than manual review. Remaining 0.1% could be non medical entries, and need to be deleted. I am open to proven techniques—fuzzy matching, phonetic hashing, Levenshtein, word embeddings, or a hybrid—as long as they scale. Python, pandas, PySpark, or any other big-data friendly stack is fine, provided the final solution is reproducible and well documented. Deliverables • Clean, executable scripts (Jupyter notebook or .py) that ingest both files, normalise product names, detect duplicates, and output a one-to-one mapping table. • A brief README explaining dependencies, algorithm logic, and how to rerun the job on new data. • Sample run on a slice of the data to prove accuracy before I hand over the entire 300 M records. Acceptance criteria 1. At least 99.9 % of medicine rows in the large dataset correctly linked to a master product code. 2. Zero false merges between distinct medicines in the master list. 3. Runtime that can finish the full job within a practical window on a modest cloud instance. If this sounds within your wheelhouse, let’s discuss approach and timeline.

Python

Реєстрація