Technical AI Output Reviewer

I’m expanding the Bespoke Labs review team and need a seasoned software engineer to judge the quality of code-centric tasks, agentic benchmarks, and reinforcement-learning environments that feed directly into frontier AI research. Your main job will be to open up the model’s work, think like a senior engineer, and decide whether the solution is correct, efficient, and reproducible. Day-to-day you’ll pull assignments from our queue, spin up the provided repo or Colab, and work through automated and manual checks. When something fails, you’ll pinpoint the issue, add concise reviewer notes, and push a resolution verdict that downstream researchers can trust. Most of the work happens in Python with a healthy dose of shell tooling, git, and containerised test harnesses, though any extra machine-learning intuition is appreciated. Availability matters here: I’m aiming for 30–40 hours each week so tasks are turned around quickly enough for our partners at OpenThoughts and Terminal Bench. All work is hourly, paid bi-weekly, and the contract is open-ended—we iterate new datasets every month and want reviewers who can grow with us. If you’re ready to apply a strong Software Engineering mindset to the cutting edge of AI evaluation, send over a brief note about your most relevant projects and the earliest date you can start. I’ll share a sample task and we’ll take it from there.

Python

Реєстрація