Extract data from pdf bill and images

I have a folder full of supplier bills in PDF format and I need a clean, repeatable Python script that pulls everything of value out of them and drops it neatly into an Excel workbook. Here is what I expect: • The script must capture every text field that appears on each bill (invoice number, dates, vendor, totals and any other descriptors). • It should identify and export any tabular line-item sections so that quantities, descriptions and prices land in true Excel rows and columns—not as a single block of text. • Embedded images or logos also need to be saved out (ideally into a sub-folder) with a reference back to the originating invoice inside the Excel sheet. Python tools such as pdfplumber, PyPDF2, camelot, tabula-py, pandas and openpyxl are all fine; choose the combination you’re most comfortable with as long as the final deliverable is a .py file plus an example .xlsx that mirrors the source PDFs accurately. Acceptance will be based on: 1. Running the script locally on a sample batch of PDFs with no manual tweaks. 2. Seeing all text and table content laid out cleanly in Excel. 3. Having each extracted image saved separately and indexed in the sheet. If additional libraries are required, let me know the pip install commands so I can replicate the environment quickly.

Python

Регистрация