Python SDTM Automation Library Development

I’m looking to turn the repetitive steps of SDTM dataset creation into a single, reusable Python package. My top priority is a well-structured library that any study team can drop into their workflow, not a one-off script. Scope • Build a modular Python package (PEP-517 compliant) that ingests raw clinical data plus study metadata (Define-XML or Excel spec) and outputs SDTM-ready domains. • Automate two key stages: – Domain identification: use NLP or other classification techniques to map source variables to the correct SDTM domain. – Metadata-driven transformations: read the study’s metadata and apply the correct variable labels, controlled terminology, and derivations automatically, following CDISC/SDTM rules. AI/ML layer I expect lightweight, maintainable models—ideally scikit-learn, spaCy, or similar—to handle variable-to-domain mapping and terminology matching. Please structure the code so models can be retrained with new studies. Deliverables • Source-controlled Python library with clear entry points (CLI and callable APIs) • Pre-trained models and training pipeline scripts • Unit tests and example notebooks that run end-to-end on a sample study • Setup/usage documentation for data managers and statisticians Acceptance The library should reproduce at least 90 % of domain assignments and variable labels on the supplied test study without manual edits, and all transformations must validate against Pinnacle 21. If you have deep CDISC/SDTM know-how and enjoy clean, testable Python code, we’ll be a great fit.

Python

Реєстрація