2TB Website Data Extraction

I have a standing need to pull roughly two terabyte of content data from several large-scale online data platforms. The information itself is public-facing page content rather than transactions or personal records, but the sheer volume calls for an approach well beyond ad-hoc scraping. Here is what I am after: a repeatable, monitored pipeline that can collect, decompress (where necessary), and store this data locally or in S3 without running into rate-limits or getting the target domains blocked. You are free to choose the stack—Python with Scrapy, Selenium, Playwright, or an equivalent solution—as long as it reliably streams the data and keeps a clear audit log of every request made. Checkpointing is important; if the run stops at 600 GB I want to be able to resume from that point, not start over. Deliverables will include • All acquisition scripts with clear instructions to rerun them • The harvested data, organised by source and timestamped • A short report summarising crawl performance, error handling, and any throttling mitigation you applied If you have handled multi-hundred-gigabyte crawls before—especially on social, news, or other high-traffic content sites—let’s talk.

Python

Регистрация