Blog and PDF Text Extraction

I’m compiling a clean corpus of text for an upcoming software tool and need reliable help pulling that content together. The material must come from two sources only: informational blogs on the open web and PDF documents that I will supply (or point you to). No images, tables, or e-commerce pages are involved plain text is all I’m after. Here’s what I need from you: • Crawl or scrape the specified blogs, capture the article body (excluding ads, headers, footers, and comments), and export the text into UTF-8 files or a single CSV. • Parse the PDFs, extract their textual content with proper page order intact, and deliver it in the same format you use for the blog data. • Keep simple metadata: source URL or PDF filename, published date when available, and article title or PDF title. I’m flexible on the tooling BeautifulSoup, Scrapy, Python pdfminer, or any stack you’re comfortable with as long as the output is clean and reproducible. Accuracy matters more than speed: I’ll run a quick spot-check to confirm that the text is complete and free of markup. If this sounds straightforward to you, let me know what library or approach you’d use and how long the initial batch (≈300 blog posts and 300 PDFs) will take. I’m ready to start right away and will provide links and sample files as soon as we agree on the plan.

Python

Реєстрація