Data Narratives & LLM Benchmarking

I have a collection of purely numerical data and I want to turn those rows and columns into clear, decision-driving stories. The plan is to generate the narratives, then let a large-language model act as an independent judge that scores those stories against insights produced by more traditional statistical analysis. What I need from you First, help me tighten the problem statement so the research goals are unambiguous. From there, design and code an end-to-end pipeline—Python is fine—that: • ingests numerical data, • produces narrative text (prompt-engineering or template-based, whichever yields stronger results), • feeds both the narrative and the raw stats into an LLM “judge,” • captures the judge’s decisions alongside classical metrics (accuracy, MAE, R², or similar), and • outputs a concise statistical report that shows where the LLM agrees or disagrees with the baseline. Automation matters. I want the entire judging cycle triggered by a single command or API call so that new data drops straight through the process without manual work. A short README that lets me reproduce results locally will be the final checkpoint. Acceptance criteria 1. A refined problem statement delivered as a living document. 2. Reproducible code (Python, Pandas, scikit-learn, LangChain/OpenAI or similar) that runs on sample data I provide. 3. A metrics table and visual summary that quantify the LLM judge’s performance against the traditional analysis. 4. One-click (or single-command) execution proving the automation.

Python

Реєстрація