LLM Evaluation & Robustness Engineer
Evaluated large language model outputs for correctness, reasoning, factual accuracy, and instruction-following criteria. Designed and executed structured evaluation pipelines to assess model performance and reliability. Provided comprehensive, high-quality feedback and explanations to enhance model and AI system efficacy. • Assessed model responses for safety, consistency, and domain fit. • Developed benchmarking protocols for reasoning and summarization tasks. • Identified model weaknesses including hallucinations and reasoning gaps. • Collaborated with teams to align model outputs with expected behaviors.