Project Name Senior Data / Knowledge Engineer - Agentic Procurement, RAG, Knowledge Graphs Alternative titles (optional, in description): • Data Architect - Agentic AI • Knowledge Systems Engineer • AI Data Platform Engineer ________________________________________ About the Role We are building agentic AI solutions for Procurement / Source-to-Pay (S2P) - systems that do not just answer questions, but take actions across workflows (sourcing, contracting, purchasing, invoicing, compliance). This role owns the data + knowledge foundation that makes agents trustworthy: turning messy enterprise content (contracts, policies, decks, spreadsheets, invoices, supplier documents) into structured, searchable, citation-backed knowledge that agents can retrieve and reason over reliably. You will work closely with product and engineering to design ingestion pipelines, retrieval systems, and quality guardrails that reduce hallucinations and improve outcomes over time. ________________________________________ What You Will Do (Responsibilities) • Design and build ingestion + ETL pipelines for messy, real-world enterprise content: o PDFs (including scanned), DOCX, PPTX, XLSX/CSV, and shared drives • Extract and structure knowledge from documents: o Entity extraction (suppliers, terms, clauses, line items, dates, obligations) o Normalization + metadata tagging (source, confidence, time, owner, category) • Build the retrieval layer that powers agents: o Vector search / embeddings, hybrid search, reranking o Knowledge graph or graph-style linking across documents (contracts <-> POs <-> invoices <-> policies) • Implement reliability + traceability: o Grounding and citations (source-backed answers) o Validation checks (consistency, schema constraints, confidence thresholds) o Evaluation harnesses to measure retrieval quality and agent accuracy • Create reusable "knowledge products" for scale: o Canonical schemas, reusable templates, curated reference chunks o Domain ontologies / taxonomies for procurement and S2P • Partner cross-functionally: o Collaborate with agent/runtime engineers, domain SMEs, and product teams o Translate procurement workflows into data models + system requirements ________________________________________ What Success Looks Like (Outcomes) • Agents produce high-precision outputs grounded in source documents with citations. • Cross-document reasoning works (for example, contract terms correctly match invoice exceptions). • Large datasets (million-row spreadsheets) are handled without dumping everything into RAG. You create summaries, aggregates, and structured stores that are fast and reliable. • The knowledge layer becomes reusable across clients/projects, not one-off. ________________________________________ Required Qualifications • 5+ years building data platforms, pipelines, or knowledge systems in production • Strong backend/data engineering skills in Python (preferred) and/or JVM/TypeScript • Experience designing schemas/data models for complex business domains • Hands-on experience with at least one: o Vector databases / embedding stores (for example, pgvector, Pinecone, Weaviate, Milvus, Elasticsearch vector) o Search systems (Elasticsearch/OpenSearch/Solr) and hybrid retrieval • Proven ability to handle enterprise data constraints: o Messy formats, access control, lineage, auditability, quality checks • Comfortable working in ambiguity and building v1 -> v2 systems quickly without sacrificing correctness ________________________________________ Nice to Have (Strong Plus) • Knowledge graph experience (Neo4j / RDF / property graphs / graph tables) • Document AI experience (OCR, layout-aware parsing, table extraction) • RAG production experience (chunking strategies, metadata filters, rerankers, evaluation sets) • LLM/agent integration experience (tool calling, agent workflows, guardrails) • Procurement / S2P domain familiarity (contracts, POs, invoices, suppliers, compliance) ________________________________________ Example Tech Stack (Flexible) • Data + ETL: Python, Airflow/Dagster/Prefect, dbt • Storage: Postgres (+pgvector), S3/GCS, Snowflake/BigQuery (optional) • Search/Retrieval: OpenSearch/Elasticsearch, vector DB, rerankers • Graphs: Neo4j or graph-enabled DB patterns • LLM orchestration: LangChain/LlamaIndex or internal framework • Infra: AWS/GCP, Docker/Kubernetes, CI/CD