Tamil Voter List OCR Optimization

Customer: AI | Published: 02.03.2026

I already have a working Python-based OCR pipeline that converts Tamil voter-list PDFs into Excel, then pushes the sheets to S3 for further processing. The PDFs are purely image-based. When I run the job in parallel on AWS today, the script sometimes skips entire voter entries and often mangles door numbers and other data. I need these two pain-points eliminated and the whole flow hardened so it can run unattended across hundreds of constituency files. Optimising to the least cost of Extraction in AWS bill is also required. Your task is to review and refactor the existing code, tune the Tamil OCR (Tesseract, AWS Textract, or any library you find more accurate), modify the parsing logic, and make sure parallel execution on my current ECS setup completes without a single missed record. Once fixed, you will execute at scale, monitor the run, and hand back clean, fully populated Excel files. Deliverables • Revised and documented OCR/parse code • One-click AWS deployment (Docker image + task definition) • Successful full-dataset run with zero skipped voters and correct door numbers (mixed letter-number cases handled) The extraction to be done at the AWS bill expense of 0.003 USD per pdf of 800 voters. This is the maximum allowed budget, within this extraction to be done with zero skipping of voters with 99% accuracy Total extraction time of 75,000 pdf to be done within 72 hours maximum • Brief monitoring log and accuracy report for sign-off