Automated Multi-Format Book Converter

I’m building a fully automated publishing pipeline that turns raw manuscript files into polished, publication-ready books. The system must ingest HTML, Markdown, and plain TXT, detect any structural metadata already present, then export perfectly styled EPUB by default, with optional DOCX and press-quality PDF versions generated in the same run. Advanced styling is essential: the converter should apply layout templates that control typography, front-matter placement, page geometry, and embedded media rules. I need the flexibility to swap or extend these templates later without rewriting the core pipeline, so clean separation between content conversion and styling logic is critical. I’m open to the tooling you prefer—Pandoc, Calibre, PrinceXML, custom Python or Node transformers, containerised micro-services, or a blend—as long as the finished workflow scales easily on a build server and can be triggered via CLI or REST. Deliverables • Source code and build scripts for the complete conversion pipeline • At least two example templates demonstrating advanced styling features • Documentation covering installation, configuration, and how to add new formats or templates • A short test suite proving that all three input types successfully produce valid EPUB, DOCX, and PDF outputs If this sounds like your kind of challenge, let’s talk timelines and the best technological path forward. PROJECT TITLE AI-Based eBook Creation & Conversion System (OCR + EPUB + AI Processing) --- 1. PROJECT OVERVIEW We are developing a scalable automated publishing system that converts multiple input formats into publication-ready EPUB books and optionally print-ready formats (DOCX/PDF). The system will: Process files in batch Maintain formatting (including tables, figures, equations) Use AI for content generation and rewriting Automatically generate book structure (Title Page, Preface, etc.) --- 2. PROJECT OBJECTIVE To build a modular, scalable, and configurable system that: 1. Converts: Scanned files (OCR) PDF HTML Word (DOCX) → into EPUB 2. Converts: EPUB → Word/PDF (print-ready) 3. Automatically generates: Title Page Copyright Page Preface Acknowledgement Table of Contents (for print output) --- 3. INPUT TYPES** A. Scanned Files OCR required Output must be editable and structured Formatting must be preserved as much as possible --- B. PDF Files Detect: Scanned vs digital Maintain: Headings Tables Layout --- C. HTML Files Direct conversion to EPUB Preserve formatting --- D. Word Files (DOCX) Convert to EPUB with formatting intact --- E. EPUB Files Convert to Word (print-ready) Generate TOC and optional Index --- 4. CORE FEATURES (MVP SCOPE)** 4.1 Batch Processing Upload multiple files Process via queue system --- 4.2 Excel-Based Metadata Input System must read Excel file Must support: Dynamic column mapping (NO hardcoding) Missing field handling --- 4.3 AI-Generated Content System must generate: Book Title (based on article titles) Preface Acknowledgement --- 4.4 AI Rewriting Feature Expand or reduce content: ±10%, 25%, 40%, 60%, 80%, 100% Must: Preserve structure Avoid plagiarism Not modify equations/tables layout --- 4.5 Table Formatting (MANDATORY) All tables must: Have grid borders Use hairline thickness (~0.25 pt) Must work in: EPUB Word PDF --- 4.6 Book Structure Generation Final EPUB must include: 1. Title Page (AI-generated title) 2. Copyright Page (template-based) 3. Preface (AI-generated) 4. Acknowledgement (AI-generated) 5. Table of Contents 6. Chapters (articles) --- 5. IMPORTANT CONTENT RULES Author Names Only names allowed NO: Designations Institutions Affiliations --- Copyright Page Template will be provided System must replace variables: ISBN eISBN Year Publisher Name Address Email --- Title Page Title → AI-generated Editor/Author Name → provided via Excel --- 6. TECHNICAL REQUIREMENTS** Preferred Stack Backend: Python (FastAPI preferred) OCR: Tesseract Conversion: Pandoc Calibre --- Architecture (MANDATORY) The system MUST be: 1. Modular Separate components: OCR Conversion AI processing Output generation --- 2. Config-Driven No hardcoding of: Excel columns Templates Prompts --- 3. Scalable Must support: Batch processing Future API integration Multi-user expansion --- 4. Replaceable Components OCR engine should be replaceable AI provider should be replaceable --- 7. UI REQUIREMENTS (BASIC) Simple interface: Upload files Upload Excel Select options: Rewrite % Generate content (yes/no) Download output --- 8. OUTPUT REQUIREMENTS EPUB Clean structure Compatible with major readers --- Word (DOCX) Print-ready Includes: TOC Proper formatting --- 9. ERROR HANDLING** System must: Skip problematic files (log errors) Continue batch processing Provide error report --- 10. PERFORMANCE REQUIREMENT** Must handle: Minimum 50–100 files per batch Should not crash on large files --- 11. DELIVERABLES Developer must provide: 1. Working application 2. Source code (fully commented) 3. Documentation: Setup instructions Config guide 4. Sample outputs --- 12. MANDATORY DEVELOPMENT CONDITIONS (VERY IMPORTANT) The developer MUST: NOT hardcode: Excel structure Templates Prompts Build system so that: Fields can be changed without code edits Prompts can be modified easily Templates can be replaced --- Code Requirements: Clean and readable Modular Future scalable --- 13. PROJECT PHASES Phase 1 (MVP) OCR + Conversion + Basic AI EPUB output --- Phase 2 (Later) Advanced indexing UI improvements Multi-language support --- 14. TIMELINE MVP: 4–6 weeks --- 15. BUDGET * Open to proposals (cost-effective preferred) * Milestone-based payment --- 16. APPLICATION REQUIREMENTS Please include: 1. Relevant experience (OCR / EPUB / document processing) 2. Tools you will use 3. Timeline 4. Cost breakdown 5. Sample work (MANDATORY) --- 17. SELECTION PROCESS Shortlisting Paid test task Final selection --- 18. IMPORTANT NOTE We are looking for a long-term developer. This project will expand significantly. ---

Реєстрація