I need a seasoned data engineer-scientist to take ownership of our entire data flow, from raw ingestion to advanced analytics. The core of the work is building, optimizing, and routinely monitoring ETL pipelines that can handle both structured tables and messy unstructured sources at scale. I work primarily in cloud ecosystems—AWS and Cloudera today, with Dataiku used for collaborative modeling—so your code must deploy cleanly across those platforms. A typical day on this project might involve spinning up a Spark cluster (Hadoop is also in play for certain batch jobs), writing production-ready Python and SQL, then pushing the results into a downstream model. Beyond moving data, I expect you to carry the torch into exploratory analysis and model development: classic time-series forecasting, supervised machine-learning workflows, deeper neural nets, all the way through to experiments with large language models when the use-case warrants it. R is on our stack for ad-hoc statistical work, so familiarity there is a plus. Deliverables I will review for acceptance: • Robust, version-controlled ETL scripts/notebooks with clear logging and error handling • Automated scheduling (Airflow or similar) and resource-optimized Spark/Hadoop jobs • Clean feature stores or marts ready for analytics • Concise documentation and hand-off notes that let another engineer pick up the torch without guesswork If iterative releases and code reviews excite you—and you thrive in hybrid cloud, big-data environments—let’s get started.