AI Code Agent Evaluation & Preference Labeling (Anthropic / Mercor)
Evaluated and compared AI code agents (LLMs) on software engineering tasks across Python, Rust, Go, Java, TypeScript, and more. Provided detailed behavioral analysis, transcript-referenced feedback, and preference ratings to generate RLHF training data for frontier AI models. Maintained high quality standards with 90+ minute evaluations per task and 3+ behavioral observations per submission.