Senior AI Data Scientist & Evaluator (Contract)
Responsible for rubric-based evaluation of responses generated by large language models using the Glue Sail methodology. Conducted extensive annotation and quality scoring on various AI agent outputs, focusing on correctness, coherence, and safety. Produced calibrated assessments and contributed to multiple AI evaluation and annotation projects for Outlier AI, Scale AI, and Mindrift. • Applied detailed rubrics to assess instruction-following and multi-step reasoning capabilities • Developed and administered STEM Q&A prompts to identify LLM failure modes • Participated in side-by-side and single-model content evaluation, including conversational persona assessments • Designed atomic rubrics subsequently adopted across annotation workflows