SEC Filings NLP Pipeline Development

Бюджет: 30 $

I am looking for a Python developer who can already log in to WRDS through wrds-py and access LSEG/Refinitiv endpoints without guidance. The assignment is to build a fully repeatable pipeline that pulls 10-K and 10-Q filings from SEC EDGAR, strips away tables and markup, and then measures keyword frequency so I can analyse language trends across firms and years. You will need to: • Authenticate and query WRDS and LSEG/Refinitiv programmatically, retrieving identifiers and any metadata that help tie the SEC filings back to the correct company and fiscal period. • Download the raw filings from EDGAR, handle rate limits, and keep a local cache so the process can be rerun without hitting the servers unnecessarily. • Clean the text (lower-case, de-html, remove boilerplate, normalise whitespace, etc.) and tokenise it with standard NLP tools such as NLTK, spaCy or similar. • Compute frequency counts for a configurable list of keywords, then aggregate those counts to a structured Pandas DataFrame keyed by company-date. • Wrap everything in clear, self-documented Python scripts or notebooks, with a README that explains dependencies, credentials setup, and how to reproduce results end-to-end. Acceptance criteria 1. Running one command (or notebook cell) fetches fresh filings, cleans them, and produces a DataFrame ready for analysis. 2. Keyword counts in the DataFrame match simple spot checks on raw filings. 3. Code is PEP-8 compliant and includes inline comments and docstrings. 4. README covers installation, API credential setup for WRDS and LSEG/Refinitiv, and execution steps. Please let me know your expected turnaround time and any prior projects where you have combined these specific data sources.

Python

Реєстрація