Grocery Ads Data Extraction from image based PDF's, not text based PDF.

Customer: AI | Published: 01.02.2026
Бюджет: 30 $

I need a small, self-contained application that can take one or more image-based PDF files of a weekly grocery sale ad, run OCR, find every sale item, and export a clean, standard CSV. It should be able to load several PDF's and process them in batches (I have approximately 250 to do, each 5-30MB in size). The CSV output file should be named the same as the corresponding PDF, and it should also write to a "master" CSV file. So if I processed 5 PDF's, there would be 6 CSV files, 5 individual CSV's with the name of each PDF, and the master CSV that contains data from all 5 individual CSV's. CSV file should contain the following columns. "Store Number", "Start Date", "Item Name", "Item Description, "Sale Type" "Sale Price", "Savings Amount". "Start Date" should equal the first day the prices are valid for. If the PDF is valid for 01/01/2026 to 01/07/2026 the start date would be 01/01/2026. Add an additional column at the end for "Sale Price Per Unit". This should either be the sale price of the item, or if it is a buy 1 get one free assume the "Save Up To" price is the price of each item, so a buy 1 get 1 free, save up to $3 would assume the normal price is $3 each, but you are getting 2 for $3 so the per unit price would be $1.50. If it is a buy two get 1 free, and the price is $3 each, so the per unit price would be $2.00 ($3+$3)/3. If a sale item doesn't fit above, for example buy Product X and receive a free Product Y, the sales type should be labeled "custom". The PDFs contain little or no embedded text, so the workflow has to start with reliable OCR—Tesseract, PaddleOCR, AWS Textract, or another engine you trust is fine as long as the accuracy is high. The ads come in different layouts, so the logic that pairs text regions with the right price blocks needs to be flexible (OpenCV or similar image-analysis libraries will probably help). I will supply several sample PDFs that reflect the typical variety. Deliverables • Fully-working source code and any helper scripts • A brief README with setup steps and command-line usage • A sample run that produces the requested CSV in standard comma-separated format Acceptance criteria When I run the tool on the provided samples, the output must list every visible item, with at least 95 % field-level accuracy and no missing rows. Feel free to build in Python, Java, or C#—whatever lets you meet the accuracy target quickly and keeps dependencies easy to install. Attached are 3 of the files.