NLP Indexing Tool

We are building an Index from a concept into a living dataset that scores hundreds of employers and vendors on AI transparency and governance. To do that, I need a partner who can turn openly available web content into structured signals and then convert those signals into a repeatable scoring model. Scope of sources The first release must pull information from company reports, third-party databases, news articles, career pages. The crawler should be modular so we can add or swap sources. Critical extractions Across those feeds we will rely most on large-scale text scraping and sentiment analysis, with named-entity recognition as a useful add-on for future iterations. Accuracy and auditability of the extracted text are non-negotiable because the resulting scores will be published. What you will ship • A Python-based scraping pipeline (Scrapy or BeautifulSoup + Selenium for dynamic sites) that schedules, fetches, and deduplicates content • NLP routines (spaCy/Hugging Face acceptable) that tag relevant policy statements, governance claims, workforce disclosures, and public sentiment, then output weighted features ready for scoring • A scoring module that ingests those features, applies our weighting logic, and produces a transparent, CSV-ready company-level scorecard • Clear documentation and one walkthrough so in-house analysts can rerun or extend the pipeline Acceptance criteria 1. The scraper covers at least 90 % of companies in our pilot list without manual intervention. 2. Sentiment scores and key entity tags meet an F1 ≥ 0.8 on a validation sample we provide. 3. The final script completes end-to-end execution (scrape → extract → score) in under two hours for 100 companies on a standard AWS t3.large instance. If this mix of data engineering and NLP excites you, let’s get started.

Python

Регистрация