Python Scraper – Robust Playwright Scraper + API Integration (Fixed Project)

Заказчик: AI | Опубликовано: 06.01.2026

We are looking for an experienced Python engineer to develop a robust, production-grade web scraper designed to extract structured exam-style question data and deliver it to an existing Django backend via a secured API. This is not a quick or hacky scraping task. The goal is to build a clean, maintainable, and resilient scraper that behaves like a human user, respects throttling limits, and can run unattended on a low-cost VPS without triggering blocks. The same developer will also handle the Django backend, so code clarity, architecture, and consistency are critical. Core Responsibilities - Build a Python-based scraper using Playwright (preferred) or a well-justified alternative. - Extract structured question data, including: - Question text - Multiple-choice options (variable count) - Correct answer - Image URL (when available) - Support multiple categories/modules via a configurable whitelist. - Normalize and prepare data for backend ingestion (deduplication handled server-side). Architecture & Integration - The scraper will send data in batches (per test/session) to a Django REST API. - Authentication via existing JWT endpoints. - Additional security via IP whitelisting. - One API call per completed batch (not per question). API Behavior Expectations - Backend responds with: - received / inserted / duplicates / failed counts - optional error summaries - optional next_backoff_seconds to dynamically slow down scraping - On API failure: - Retry with backoff - If still failing: persist batch to disk (spool) and stop execution safely Throttling & Human-Like Behavior (Configurable) All delays must be fully configurable via YAML (no hardcoded values): - Delay between interactions (seconds, support random ranges) - Delay between batches/tests - Long cooldown after N batches - Respect backend-provided backoff signals The scraper must be designed to avoid detection and blocking, prioritizing stability over speed. Error Handling & Observability - Clear structured logging (INFO / WARN / ERROR). - On scraping failure: - Capture screenshot + HTML dump - Abort execution safely - Upload artifacts to a shared storage (e.g. Slack / Google Drive / Mega or equivalent) - API failures must generate: - logs - persisted spool files (one JSON file per failed batch) Configuration - YAML-only configuration (no CLI overrides required). - Configurable items include: - enabled modules (whitelist) - language selection - throttling / cooldown parameters - API endpoint & credentials - spool directory - runtime options (headless / headed) Deliverables 1.- Clean, well-structured Python project: - Modular codebase - Clear separation of concerns 2.- Configuration template (config.yaml) 3.- Logging & error-handling implementation 4.- API integration with retry + spool logic 5.- Screenshot & HTML capture on critical failures 6.- Documentation (README.md) covering: - setup - configuration - execution - failure recovery 7.- Dependency management: - requirements.txt - pyproject.toml - Dockerfile Quality Expectations - No brittle hardcoded XPaths - Robust selectors with fallbacks - Clean, readable, maintainable code - No scraping shortcuts that would cause instability - Designed for long-running unattended execution Acceptance Criteria - Successfully handles all whitelisted modules - Completes at least 10 test runs per module without failure - Sends data to backend reliably in batch mode - Correctly handles API backoff and failures - Produces usable logs and diagnostics on error Required Skills - Strong Python experience - Playwright or Selenium automation - REST API integration (JWT-based auth) - Experience with long-running scrapers - Familiarity with rate-limiting and anti-blocking strategies - Linux/VPS execution experience Project Type - Fixed-price project - Milestones will be clearly defined and agreed upon before starting Notes - Target platform UI is not in English, so selector robustness is mandatory. - This project values engineering quality and reliability over raw speed. - Strong communication and clean delivery are expected.