AI Coding Agent Evaluation & RL Environments Engineer
At Mindrift (Toloka AI) I design adversarial evaluation tasks for Claude Opus 4.6 within simulated company environments — complete with Python and TypeScript repositories, Jira tickets, documentation, and Slack messages. My goal is to craft prompts calibrated so the agent fails roughly half the time, then write automated end-to-end validation using the pytest framework, including system tests and AST-based code verification. After each agent run, I perform deep analysis of the diffs and transcripts, extracting concrete evidence of genuine reasoning failures that feeds directly into Opus's training pipeline.