Playbill PDF Parser & Theater Database (Phase 1

Бюджет: 250 $

⚠️ IMPORTANT – READ BEFORE APPLYING This is NOT a typical website or WordPress project. This project is about accurately parsing inconsistent PDF documents and extracting structured data. If you have not built a PDF parser where text position, columns, or layout matters, do not apply. Project Overview We are building a foundational system that: Parses Playbill PDF files Extracts structured theatrical data Stores it in a MySQL database Allows admin review and correction This system will later power a public platform (IMDb-style for theater), but Phase 1 is strictly backend + admin tools. Phase 1 Scope (Must Be Completed Fully) 1️⃣ PDF Upload & Parsing Upload Playbill PDFs Parse without using AI / ChatGPT / OCR Use deterministic logic (regex + layout analysis) 2️⃣ Required Data Extraction From each PDF, extract: Show name Theater name Cast list → Actor → Character mapping Crew credits (director, writer, designers, etc.) Handle: Two-column layouts Sectioned layouts (Cast, Ensemble, Production Team) Inline credits and inconsistent formatting Duo credits (e.g. “Book by A & B” → stored individually but marked as shared) 3️⃣ Parsing Rules Must support: Regex rules Visual position parsing (x/y coordinates) Fallback logic when a line cannot be confidently parsed Flag uncertain lines as “Needs Review” Suggested tools (not mandatory): pdfplumber pdfminer PyMuPDF Similar layout-aware PDF tools 4️⃣ Admin Dashboard Review parsed data Edit any field manually Approve or correct flagged lines Manually add YouTube video links 5️⃣ Database MySQL schema with proper relationships Video table: video_url show_id (required) theater_id (optional) year (optional) ❌ What This Project Is NOT ❌ Not WordPress ❌ Not UI-heavy ❌ Not AI-based parsing ❌ Not OCR ❌ Not a “quick script” This is a precision engineering task. Required Experience (Strict) You must have: Proven experience parsing PDFs with layout awareness Experience handling messy, inconsistent document structures Backend experience (Python, PHP, or Node.js) MySQL database design experience Screening Question (Mandatory) Your proposal must answer this clearly: Describe a PDF parsing project you built where text position or layout mattered. What tools did you use (e.g., pdfplumber, pdfminer, PyMuPDF)? What kind of PDF was it? Proposals without this answer will be rejected immediately. Deliverables Working PDF parser MySQL database with extracted data Admin review interface Clean, documented code Future Phases (Not Included Now) Public actor/show/theater pages Joomla integration AI-assisted parsing improvements Analytics & engagement tracking Budget & Timeline Fixed price or hourly (open to discussion) Quality and correctness matter more than speed Final Note If you enjoy hard parsing problems and building systems that must be correct, this project is for you. If you are a generalist web developer, this project is not a fit.

Python

Регистрация

Playbill PDF Parser & Theater Database (Phase 1 – Core System)