EUR-Lex Text & PDF Scraper

Бюджет: 250 $

I want to scrape all legal text from EUR LEX (url here: https://eur-lex.europa.eu/advanced-search-form.html). The count is about a million documents. For every document please: • Download the official PDF in its original layout. • Pull the full plain text in BOTH English (EN) and German (DE). • Generate a companion JSON file per document containing: – document_id (as it appears in the URL), – year, – raw_text_en, – raw_text_de, – pdf_file_name set exactly to “<document_id>.pdf”. Folder structure can be simple—one directory that holds each PDF and its matching JSON with identical IDs. Accuracy matters because the data feeds directly into a research pipeline, so please handle character encoding and long parliamentary tables correctly. A lightweight Python solution using requests/BeautifulSoup or Scrapy is perfect; headless Selenium is fine if needed for dynamic pages. All code, a brief README, and any requirements.txt must be included so I can reproduce the run locally. Once the script finishes, send the ZIP containing: 1. Source code. 2. The PDFs. 3. All JSON files. No ongoing schedule is required—just this single extraction.

Python

Регистрация