AI Evaluation Analyst (RWS)
In this role, I performed expert evaluation of large language model (LLM) outputs, assessing accuracy, relevance, and behavioral alignment. I designed and refined evaluation rubrics tailored to domain-specific tasks and policy compliance. I provided analytical reporting that informed improvements in model behavior and benchmarking processes. • Reviewed generated text outputs for issues related to logical consistency, hallucinations, and prompt ambiguity. • Collaborated with research teams to define and refine structured evaluation criteria. • Utilized remote evaluation tools and followed stringent documentation protocols. • Supported ongoing benchmarking and human-in-the-loop feedback cycles.