Bilingual Text Web Crawler

Бюджет: 10000 $

I’m building a parallel-text corpus and need a crawler that automatically discovers and harvests bilingual web pages. The focus is on sites that present content side-by-side or via language-switch links in combinations of English–Spanish, English–French, and English–German etc. For every qualifying page, I only care about the raw textual content; no PDFs, Word files, images, or metadata are required. Here’s what I have in mind: a Python-based scraper—Scrapy, Playwright, or another framework you prefer—that starts from a seed list I’ll supply, follows internal language toggles or sitemap clues, detects the two language versions, and stores the cleaned text for each language in clearly paired JSON or CSV records. A quick language-ID check or alignment heuristic is fine as long as the output lets me see which paragraph belongs to which language. Deliverables • A reusable crawler script with clear setup instructions • Configurable settings for rate limiting and polite crawling (robots.txt respect, user-agent string, delays) • Output files containing the aligned English–Spanish, English–French, and English–German text pairs • A brief README explaining how to extend the crawler to additional language pairs later Acceptance criteria The crawler must run on my Linux machine, process at least 500 bilingual pages from the initial seeds without manual intervention, and produce clean, UTF-8 encoded text with less than 2 % HTML residue on a random sample. If this matches your expertise in web scraping and NLP preprocessing, let’s talk through the details and timelines.

Python

Реєстрація