AI Model Evaluator – Benchmark Project
As an AI Model Evaluator on the Benchmark Project at Turing, I developed ideal reference solutions for tasks used in frontier AI model evaluations. My responsibilities included systematically assessing model outputs from systems such as OpenAI, Claude, and Gemini for correctness, reasoning, and instruction following. I provided comprehensive feedback to enhance the quality of AI systems through structured analysis. • Evaluated AI model responses for accuracy and completeness. • Benchmarked frontier language models for reasoning and adherence to prompts. • Developed task solutions and assessment frameworks. • Identified model weaknesses and hallucinations.