Quantum PDF Extraction Pipeline

Замовник: AI | Опубліковано: 17.12.2025

Project Title Automated Dataset Creation: Image-to-Text Dataset for Quantum Circuit Diagrams Project Overview I am looking to hire a skilled freelancer to help implement an automated, reproducible Python pipeline for building a specialized dataset used in image-to-text models for quantum computing. The task focuses on extracting quantum circuit schematic images from scientific PDFs (arXiv), along with structured textual metadata. This is a research-oriented NLP + document-processing project with strict technical constraints and quality requirements. All detailed requirements, rules, and constraints are already defined in the project specification. Additional internal resources and clarifications will be shared after finalizing the freelancer. Scope of Work (High-Level) The freelancer will help design and/or implement parts of a pipeline that: Processes a fixed list of arXiv papers in strict order Extracts quantum circuit images only (not plots, tables, or setups) Saves images in PNG format Builds a JSON dataset with structured metadata per image Extracts and aligns descriptive text from PDFs Identifies quantum gates and associated algorithms where possible Produces clean, reproducible, well-documented Python code The solution must be fully automated, generalizable, and runnable on a restricted academic environment (no external APIs except arXiv). Minimum Technical Requirements (Must-Have) Programming & Tools Strong Python skills Hands-on experience with PDF processing PyMuPDF (fitz) and/or pdf2image Experience with text extraction from scientific PDFs Solid understanding of JSON and CSV data pipelines Clean, modular, well-documented code style (functions, docstrings, comments) NLP & OCR Practical experience with OCR pipelines PaddleOCR and/or Tesseract Ability to clean and normalize noisy OCR outputs Basic NLP techniques for keyword detection and sentence extraction Regex, rule-based parsing, optional SpaCy usage Reproducibility & Quality Ability to write deterministic, reproducible code Attention to dataset consistency and validation Comfortable working under strict academic evaluation criteria Required Domain Knowledge You do not need to be a quantum physicist, but you must have: Basic understanding of quantum circuits Common gates (H, X, Y, Z, CNOT/CX, SWAP, Toffoli, etc.) Familiarity with quantum algorithms terminology Examples: Shor’s algorithm, Grover’s algorithm, QFT, VQE, teleportation Ability to recognize quantum circuit diagrams visually and textually This level of knowledge is essential to avoid false positives and poor dataset quality. Nice-to-Have Skills Prior work with scientific document mining Experience building datasets for ML / NLP Experience working on research or university-level projects Familiarity with Linux / lab environments Experience writing short technical documentation or reports Constraints (Very Important) ❌ No external APIs or cloud services (except arXiv PDF download) ❌ No manual hand-crafted dataset entries ✅ Fully automated and reproducible solution only ✅ Code must run on a restricted academic machine environment ✅ Focus on quality over shortcuts Deliverables (Expected) Python source code (modular, documented) Extracted images (PNG) JSON dataset with required metadata CSV summary of processed papers Clear explanation of approach and assumptions (inline comments + short notes) Project Timeline Short, fixed academic timeline (approximately 1 week) Looking for someone who can start immediately and work efficiently