Human Evaluation
The project focused on training and evaluating LLMs and agent-based systems using high-quality human-labeled data to improve reasoning, instruction-following, and reliability. I worked across both single-turn responses and multi-step agent workflows, supporting RLHF and model evaluation efforts. My tasks included labeling outputs for accuracy, hallucination risk, instruction adherence, reasoning quality, and safety, conducting preference ranking, and reviewing agent traces for task decomposition, tool usage, state management, and error handling. The work covered hundreds to thousands of examples across iterative review cycles. Quality was maintained through strict annotation guidelines, consistency checks, spot audits, and escalation of high-severity issues, ensuring reliable, production-ready training data.