AI Red Teaming & Vibe Coding Engineer for Safety Eval Platform

Бюджет: 8 $

We are looking for an AI Red Teaming Engineer / Vibe Coding Engineer to work directly on our internal safety evaluation platform for AI models, with a focus on model-to-model testing (no real users involved). Your role will be to design and run adversarial evaluations against AI systems, simulate realistic personas and conversations, and help us detect and categorize safety failures (guardrail breaks, emotional risks, harmful content, etc.) in AI products designed for children and families. You will also collaborate on improving the underlying platform and evaluation pipelines. Responsibilities: Design model-to-model red teaming tests targeting: Unhealthy emotional attachment behaviors Harmful or unsafe content Manipulative or coercive responses Other child-safety-related risks Create and refine prompts, scenarios, and personas to probe model weaknesses. Use LLM-based judging / evals to score and categorize outputs. Document failure patterns and safety findings in a clear, structured format. Collaborate with the engineering team to iterate on evaluation pipelines and tooling. Tech & skills (nice to have, not all mandatory): Strong experience with LLM prompt engineering and adversarial testing. Python for scripting evaluations and analysis. Experience with TypeScript/JavaScript, React, or similar front-end stacks. Familiarity with platforms like Replit or cloud-based dev environments. Experience in AI safety, trust & safety, or content moderation is a plus. What we’re looking for: You have prior hands-on experience red teaming LLMs or building evaluation pipelines. You can think creatively about personas, edge cases, and emergent harms. You can clearly explain safety findings to both technical and non-technical audiences. You are reliable, communicative, and comfortable working async with clear deliverables. To apply, please answer briefly: How would you design a model-to-model test to detect unhealthy emotional attachment behaviors in AI systems used by children? Mention specific signals or failure patterns you’d look for. What is one limitation of LLM-based judging you’ve personally encountered, and how did you mitigate it? When writing safety findings for non-technical audiences (e.g., parents), what do you avoid including, and why? Confirm: Are you comfortable working in an online IDE (e.g., Replit or similar)? Can you work with an existing TypeScript/React or similar codebase if needed? Share a link to your GitHub/portfolio or relevant case studies. Please start your proposal with the word “SAFEGUARD” so I know you read the description.

Python

Регистрация