Law360 PDF Data Extraction

Бюджет: 750 $

The goal is to turn a collection of Law360 (LexisNexis) PDF articles into a clean, tabular dataset that I can open in Excel or any CSV-compatible tool. From each PDF I need the following fields captured: • News date • Filing date • Court • Plaintiff (own column) • Defendant (own column) Accuracy matters: plaintiff and defendant names must sit in separate columns just as selected. Use any reliable text-parsing approach—Python with pdfminer, PyPDF2, Tika, Regex, or an NLP library—so long as the script handles typical Law360 layouts and can be rerun on future batches. Please return: 1. The compiled .csv or .xlsx file. 2. The extraction script with brief instructions so I can reproduce or extend the process. 3. A short report of any PDFs that failed to parse or produced incomplete rows. Acceptance criteria: every supplied PDF is processed; the resulting spreadsheet has the six columns listed above with correct values, and the code runs without manual tweaks beyond path changes. If you have prior experience scraping legal publications or working with semi-structured PDFs, that will help you move quickly, but it’s not required—the deliverable quality is what matters.

Python

Регистрация