PDF-to-Excel Status Scraper

I have a list of application numbers that belong to a government portal. For each number the site returns a PDF; inside that PDF I only need one piece of information—its current Application status. Here is the workflow I’d like automated: 1. I upload an Excel or CSV file that contains the application numbers. 2. Your script logs in or navigates to the public search page, pulls down the corresponding PDF for every number, extracts the status field with reliable accuracy, and writes the result back to a new column in the same spreadsheet. 3. The tool should run head-less on Windows or via request in Python, and I’d like clear setup instructions so I can repeat the job whenever new numbers come in. Python is my default choice because of its rich PDF-parsing libraries (PyPDF2, pdfminer.six, pdfplumber) and Requests or Selenium or /BeautifulSoup for the web interactions, but I’m open to other stacks if they speed things up and keep the output clean. Acceptance criteria • I hand the tool a file of application numbers and receive an .xlsx with a “Status” column populated. • 100 % coverage of the input list and correct mapping of each status. • Graceful logging of any failures (e.g., missing PDF, changed page layout). • Clear README or short video showing setup and execution. If you have proven experience scraping government PDF documents and converting them into structured Excel data, you’ll be able to turn this around quickly. Let me know how you plan to tackle the capture, parsing and error handling, and we can get started right away.

Python

Регистрация