Multi-turn LLM Evaluation and Annotation (LMArena V2 – Scientific Domains)
I worked on AI training data tasks involving multi-turn evaluation and annotation of large language model (LLM) responses within the LMArena V2 project, covering life, physical, and social science domains. My role involved reviewing and comparing responses across up to six conversational turns, assessing consistency, reasoning quality, accuracy, and adherence to instructions throughout the interaction. I ranked responses based on clarity, relevance, and correctness, while identifying issues such as hallucinations, logical inconsistencies, and breakdowns in multi-step reasoning. I applied detailed evaluation guidelines to ensure consistent and reliable annotations across tasks. This work required strong analytical thinking, attention to detail, and the ability to evaluate context over extended conversations, contributing directly to reinforcement learning from human feedback (RLHF) and improving model performance.