PDF Table Extraction & Charting on website

I have a collection of mixed-content PDFs—some pages are pure scans, the rest contain searchable text. Each file holds key tables I need converted into clean, structured data and then visualised as clear bar charts for quick comparison. You will: • Isolate every table across the PDFs, using OCR where the page is only an image. • Clean and normalise the numbers so columns and units stay consistent. • Produce bar charts that faithfully reflect the extracted figures (one chart per table unless otherwise noted). Deliverables 1. A single, tidy dataset (CSV or Excel) with each table clearly identified. 2. High-resolution bar chart images (PNG or SVG) and the source file (Excel, Power BI, or Python notebook) so I can regenerate them later. 3. A short note outlining any assumptions, edge cases handled, and pages that required manual correction. Acceptance Criteria • Every numeric value found in the original tables appears in the dataset, spot-checked against the PDF for accuracy. • Charts label axes, units, and categories clearly, with no truncated text. • No data lost due to OCR errors; if a cell cannot be resolved it is flagged in the note. Feel free to choose the tools you’re most comfortable with—Python (Camelot, Tabula-py, Pandas, Tesseract), R, or Excel macros are all fine as long as the final files meet the criteria above.

Python

Регистрация